aqualog: an ontology-driven question answering system for...

34
Web Semantics: Science, Services and Agents on the World Wide Web 5 (2007) 72–105 AquaLog: An ontology-driven question answering system for organizational semantic intranets Vanessa Lopez , Victoria Uren, Enrico Motta, Michele Pasin Knowledge Media Institute & Centre for Research in Computing, The Open University, Walton Hall, Milton Keynes MK7 6AA, United Kingdom Received 29 July 2005; accepted 23 March 2007 Available online 31 March 2007 Abstract The semantic web vision is one in which rich, ontology-based semantic markup will become widely available. The availability of semantic markup on the web opens the way to novel, sophisticated forms of question answering. AquaLog is a portable question-answering system which takes queries expressed in natural language and an ontology as input, and returns answers drawn from one or more knowledge bases (KBs). We say that AquaLog is portable because the configuration time required to customize the system for a particular ontology is negligible. AquaLog presents an elegant solution in which different strategies are combined together in a novel way. It makes use of the GATE NLP platform, string metric algorithms, WordNet and a novel ontology-based relation similarity service to make sense of user queries with respect to the target KB. Moreover it also includes a learning component, which ensures that the performance of the system improves over the time, in response to the particular community jargon used by end users. © 2007 Elsevier B.V. All rights reserved. Keywords: Semantic web; Question answering; Ontologies; Natural language 1. Introduction The semantic web vision [7] is one in which rich, ontology- based semantic markup is widely available, enabling an author to improve content by adding meta-information. This means that with structured or semi-structured documents, texts can be semantically marked-up and ontological support for term def- inition provided, both to enable sophisticated interoperability among agents, e.g., in the e-commerce area, and to support human web users in locating and making sense of information. For instance, tools such as Magpie [19] support semantic web browsing by allowing users to select a particular ontology and use it as a kind of ‘semantic lens’, which assists them in mak- ing sense of the information they are looking at. As discussed by McGuinness in her essay on “Question Answering on the Semantic Web” [41], the availability of semantic markup on the web also opens the way to novel, sophisticated forms of ques- Corresponding author. Fax: +44 1908 653169. E-mail addresses: [email protected] (V. Lopez), [email protected] (V. Uren), [email protected] (E. Motta), [email protected] (M. Pasin). tion answering, which not only can potentially provide increased precision and recall compared to today’s search engines, but are also capable of offering additional functionalities, such as (i) proactively offering additional information about an answer, (ii) providing measures of reliability and trust and/or (iii) explaining how the answer was derived. Therefore, most emphasis in QA is currently on the use of ontologies on the fly to mark-up text and make its retrieval smarter by using query expansion [41]. While semantic information can be used in several different ways to improve question answering, an important (and fairly obvious) consequence of the availability of semantic markup on the web is that this can indeed be queried directly. In other words, we can exploit the availability of semantic statements to provide precise answers to complex queries, allowing the use of inference and object manipulation. Moreover, as semantic markup becomes ubiquitous, it will become advantageous to be able to ask queries and obtain answers, using natural lan- guage (NL) expressions, rather than the keyword-based retrieval mechanisms used by the current search engines. For instance, we are currently augmenting our departmental web site in the Knowledge Media Institute (KMi), http://kmi. open.ac.uk, with semantic markup, by instantiating an ontology 1570-8268/$ – see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.websem.2007.03.003

Upload: others

Post on 27-Mar-2020

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

A

mtspmMpcopy

K

1

bttsiahFbuibSw

vm

1d

Web Semantics Science Services and Agentson the World Wide Web 5 (2007) 72ndash105

AquaLog An ontology-driven question answeringsystem for organizational semantic intranets

Vanessa Lopez lowast Victoria Uren Enrico Motta Michele PasinKnowledge Media Institute amp Centre for Research in Computing The Open University

Walton Hall Milton Keynes MK7 6AA United Kingdom

Received 29 July 2005 accepted 23 March 2007Available online 31 March 2007

bstract

The semantic web vision is one in which rich ontology-based semantic markup will become widely available The availability of semanticarkup on the web opens the way to novel sophisticated forms of question answering AquaLog is a portable question-answering system which

akes queries expressed in natural language and an ontology as input and returns answers drawn from one or more knowledge bases (KBs) Weay that AquaLog is portable because the configuration time required to customize the system for a particular ontology is negligible AquaLogresents an elegant solution in which different strategies are combined together in a novel way It makes use of the GATE NLP platform string

etric algorithms WordNet and a novel ontology-based relation similarity service to make sense of user queries with respect to the target KBoreover it also includes a learning component which ensures that the performance of the system improves over the time in response to the

articular community jargon used by end users2007 Elsevier BV All rights reserved

tpapphia

woo

eywords Semantic web Question answering Ontologies Natural language

Introduction

The semantic web vision [7] is one in which rich ontology-ased semantic markup is widely available enabling an authoro improve content by adding meta-information This meanshat with structured or semi-structured documents texts can beemantically marked-up and ontological support for term def-nition provided both to enable sophisticated interoperabilitymong agents eg in the e-commerce area and to supportuman web users in locating and making sense of informationor instance tools such as Magpie [19] support semantic webrowsing by allowing users to select a particular ontology and

se it as a kind of lsquosemantic lensrsquo which assists them in mak-ng sense of the information they are looking at As discussedy McGuinness in her essay on ldquoQuestion Answering on theemantic Webrdquo [41] the availability of semantic markup on theeb also opens the way to novel sophisticated forms of ques-

lowast Corresponding author Fax +44 1908 653169E-mail addresses vlopezopenacuk (V Lopez)

surenopenacuk (V Uren) emottaopenacuk (E Motta)pasinopenacuk (M Pasin)

wpombgm

wo

570-8268$ ndash see front matter copy 2007 Elsevier BV All rights reservedoi101016jwebsem200703003

ion answering which not only can potentially provide increasedrecision and recall compared to todayrsquos search engines but arelso capable of offering additional functionalities such as (i)roactively offering additional information about an answer (ii)roviding measures of reliability and trust andor (iii) explainingow the answer was derived Therefore most emphasis in QAs currently on the use of ontologies on the fly to mark-up textnd make its retrieval smarter by using query expansion [41]

While semantic information can be used in several differentays to improve question answering an important (and fairlybvious) consequence of the availability of semantic markupn the web is that this can indeed be queried directly In otherords we can exploit the availability of semantic statements torovide precise answers to complex queries allowing the usef inference and object manipulation Moreover as semanticarkup becomes ubiquitous it will become advantageous to

e able to ask queries and obtain answers using natural lan-uage (NL) expressions rather than the keyword-based retrieval

echanisms used by the current search enginesFor instance we are currently augmenting our departmental

eb site in the Knowledge Media Institute (KMi) httpkmipenacuk with semantic markup by instantiating an ontology

s and

dnepsqDosoiKeic

lacsttair

iseiwwtr

atlmiwpatattt

staedaam

LAavwctTtj

updf

aACtIdmae

2

LwirSaq

oaKsorimHta

dtii

V Lopez et al Web Semantics Science Service

escribing academic life [1] with information about our person-el projects technologies events etc which is automaticallyxtracted from departmental databases and unstructured webages In the context of standard keyword-based search thisemantic markup makes it possible to ensure that standard searchueries such as ldquoPeter Scott home page KMirdquo actually returnr Peter Scottrsquos home page as their first result rather than somether resource (as is the case when using current non-semanticearch engines on this particular query) Moreover as pointedut above we can also query this semantic markup directly Fornstance we can ask a query such as ldquowhich are the projects inMi related to the semantic web areardquo and thanks to an infer-

nce engine able to reason about the semantic markup and drawnferences from axioms in the ontology we can then get theorrect answer

This scenario is of course very similar to asking naturalanguage queries to databases (NLIDB) which has long beenn area of research in the artificial intelligence and databaseommunities [83021028] even if in the past decade it hasomewhat gone out of fashion [327] However it is our viewhat the semantic web provides a new and potentially very impor-ant context in which results from this area of research can bepplied Moreover interestingly from a research point of viewt provides a new lsquotwistrsquo on the old issues associated with NLDBesearch

As pointed out in Ref [24] the key limitation of the NLnterfaces to databases is that it presumes the knowledge theystem is using to answer the question is a structured knowl-dge base in a limited domain However an important advantages that a knowledge based question answering system can helpith answering questions requiring situation-specific answershere multiple pieces of information need to be combined and

herefore the answers are being inferred at run time rather thaneciting a pre-written paragraph of text [12]

Hence in the first instance the work on the AquaLog querynswering system described in this paper is based on the premisehat the semantic web will benefit from the availability of naturalanguage query interfaces which allow users to query semantic

arkup viewed as a structured knowledge base Moreover sim-larly to the approach we have adopted in the Magpie systeme believe that in the semantic web scenario it makes sense torovide query answering systems on the semantic web whichre portable with respect to ontologies In other words just as inhe case of tools such as Magpie where the user is able to selectn ontology (essentially a semantic viewpoint) and then browsehe web through this semantic filter our AquaLog system allowshe user to choose an ontology and then ask queries with respecto the universe of discourse covered by the ontology

Hence AquaLog our natural language front-end for theemantic web is a portable question-answering system whichakes queries expressed in natural language and an ontologys input and returns answers drawn from one or more knowl-dge bases (KBs) which instantiate the input ontology with

omain-specific information Thus AquaLog is especially suit-ble as front end to organizational semantic intranets wheren organizational ontology is used as the basics for semanticark up Of relevance is the valuable work done in Aqua-

in

P

Agents on the World Wide Web 5 (2007) 72ndash105 73

og on the syntactic and semantic analysis of the questionsquaLog makes use of the GATE NLP platform string metric

lgorithms WordNet and novel ontology-based similarity ser-ices for relations and classes to make sense of user queriesith respect to the target knowledge base Also AquaLog is

oupled with a portable and contextualized learning mechanismo obtain domain-dependent knowledge by creating a lexiconhe learning component ensures that the performance of the sys-

em improves over time in response to the particular communityargon of the end users

Finally although AquaLog has primarily been designed forse with semantic web languages it makes use of a genericlug-in mechanism which means it can be easily interfaced toifferent ontology servers and knowledge representation plat-orms

The paper is organized as follow in Section 2 we presentn example of AquaLog in action In Section 3 we describe thequaLog architecture In Section 4 we describe the Linguisticomponent embedded in AquaLog In Section 5 the novel Rela-

ion Similarity Service In Section 6 the Learning Mechanismn Section 7 the evaluation scenario followed by discussion andirections for future work in Section 8 In Section 9 we sum-arize related work In Section 10 we give a summary Finally

n appendix is attached with examples of NL queries and theirquivalent triple representation

AquaLog in action illustrative example

First of all at a coarse-grained level of abstraction the Aqua-og architecture can be characterized as a cascaded model inhich a NL query gets translated by the Linguistic Component

nto a set of intermediate triple-based representations which areeferred to as the Query-Triples Then the Relation Similarityervice (RSS) component takes as an input these Query-Triplesnd further processes them to produce the ontology-compliantueries called Onto-Triples as shown in Fig 1

The data model is triple-based namely it takes the formf 〈subject predicate object〉 There are two main reasons fordopting a triple-based data model First of all as pointed out byatz et al [31] although not all possible queries can be repre-

ented in the binary relational model in practice these exceptionsccur very infrequently Secondly RDF-based knowledge rep-esentation (KR) formalisms for the semantic web such as RDFtself [47] or OWL [40] also subscribe to this binary relationalodel and express statements as 〈subject predicate object〉ence it makes sense for a query system targeted at the seman-

ic web to adopt a triple-based model that shares the same formats many millions of other triples on the Semantic Web

For example in the context of the academic domain in ourepartment AquaLog is able to translate the question ldquowhat ishe homepage of Peter who has an interest on the semantic webrdquonto the following ontology-compliant logical query 〈whats has-web-address peter-scott〉 and 〈person has-research-

nterest Semantic Web area〉 expressed as a conjunction ofon-ground triples (ie triples containing variables)

In particular consider the query ldquowhat is the homepage ofeterrdquo the role of the RSS is to map the intermediate form

74 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

uaLo

〈n

idatSMbaotac

aao

a(sianoatIhldquo

3

Fig 1 The Aq

what is homepage peter〉 obtained by the linguistic compo-ent into the target ontology-compliant query

The RSS calls the user to participate in the QA process if nonformation is available to AquaLog to disambiguate the queryirectly Letrsquos consider the query in Fig 2 On the left screen were looking for the homepage of Peter By using string metricshe system is unable to disambiguate between Peter-Scott Peter-harpe Peter-Whalley etc Therefore user feedback is requiredoreover on the right screen user feedback is required to disam-

iguate the term ldquohomepagerdquo (is the same as ldquohas-web-addressrdquo)s it is the first time the system came across this term no syn-nyms have been identified in WordNet that relate both labelsogether and the ontology does not provide further ways to dis-mbiguate We ask the system to learn the userrsquos vocabulary andontext for future occasions

In Fig 3 we are asking for the web address of Peter who hasn interest in semantic web In this case AquaLog does not needny assistance from the user given that by analyzing the ontol-gy only one of the ldquoPetersrdquo has an interest in Semantic Web

cpf

Fig 2 Illustrative example of user interac

g data model

nd only one possible ontology relation ldquohas-research-interestrdquotaking into account taxonomy inheritance) exists between ldquoper-onrdquo and the concept ldquoresearch areardquo of which ldquosemantic webrdquos an instance Also the similarity relation between ldquohomepagerdquond ldquohas-web-addressrdquo has been learned by the Learning Mecha-ism in the context of that query arguments so the performancef the system improves over the time When the RSS comescross a similar query it has to access the ontology informa-ion to recreate the context and complete the ontology triplesn that way it realizes that the second part of the query ldquowhoas an interest on the Semantic Webrdquo is a modifier of the termPeterrdquo

AquaLog architecture

AquaLog is implemented in Java as a modular web appli-ation using a clientndashserver architecture Moreover AquaLogrovides an API which allows future integration in other plat-orms and independent use of its components Fig 4 shows

tivity to disambiguate a basic query

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 75

Fig 3 Illustrative example of AquaLog disambiguation

og gl

tAm

voarr

A

ei

wet

Fig 4 The AquaL

he different components of the AquaLog architectural solutions a result of this design our existing version of AquaLog isodular flexible and scalable1

The Linguistic Component and the Relation Similarity Ser-ice (RSS) are the two central components of AquaLog bothf them portable and independent in nature Currently AquaLog

utomatically classifies the query following a linguistic crite-ion The linguistic coverage can be extended through the use ofegular expressions by augmenting the patterns covered by an

1 AquaLog is available for developers as Open Source licensed under thepache License Version 20 httpssourceforgenetprojectsaqualog

uif

l

obal architecture

xisting classification or creating a new one This functionalitys discussed in Section 4

A key feature of AquaLog is the use of a plug-in mechanismhich allows AquaLog to be configured for different Knowl-

dge Representation (KR) languages Currently we subscribeo the Operational Conceptual Modeling Language (OCML)

sing our own OCML-based KR infrastructure [52] Howevern future our aim is also to provide direct plug-in mechanismsor the emerging RDF and OWL servers2

2 We are able to import and export RDF(S) and OWL from OCML so theack of an explicit RDFOWL plug-in is not a problem in practice

7 s and

ewLtirLiofda

htssaanc

cldquoopf

ptcwt

3

scqTmtawdttto

sttll

aiApcatwbbcwads

o

4

toasirtlLG

atEktnvgeaciqceoi

When developing AquaLog we extended the set of annota-tions returned by GATE by identifying noun terms relations4

question indicators (whichwhowhen etc) and patterns or types

6 V Lopez et al Web Semantics Science Service

To reduce the number of callsrequests to the target knowl-dge base and to guarantee real-time question answering evenhen multiple users access the server simultaneously the Aqua-og server accesses and caches basic indexing data from the

arget Knowledge Bases (KBs) at initialization time The cachednformation in the server can be efficiently accessed by theemote clients In our research department KMi since Aqua-og is also integrated into the KMi Semantic Portal [35] the KB

s dynamically and constantly changing with new informationbtained from KMi websites through different agents There-ore a mechanism is provided to update the cached indexingata on the AquaLog server This mechanism is called by thesegents when they update the KB

As regards portability the only entry points which requireuman intervention for adapting AquaLog to a new domain arehe configuration files Through the configuration files it is pos-ible to specify the parameters needed to initialize the AquaLogerver The most important parameters are the ontology namend server login details if necessary the name of the plug-innd slots that correspond to pretty names Pretty names are alter-ative names that an instance may have Optionally the mainoncepts of interest in an ontology can be specified

The other ontology-dependent parameters required in theonfiguration files are the ones which specify that the termwhordquo ldquowhererdquo ldquowhenrdquo corresponds for example to thentology terms ldquopersonorganizationrdquo ldquolocationrdquo and ldquotime-ositionrdquo respectively The independence of the applicationrom the ontology is guaranteed through these parameters

There is no single strategy by which a query can be inter-reted because this process is highly dependent on the type ofhe query So in the next sections we will not attempt to give aomprehensive overview of all the possible strategies but rathere will show a few examples which are illustrative of the way

he different AquaLog components work together

1 AquaLog triple approach

Core to the overall architecture is the triple-based data repre-entation approach A major challenge in the development of theurrent version of AquaLog is to efficiently deal with complexueries in which there could be more than one or two termshese terms may take the form of modifiers that change theeaning of other syntactic constituents and they can be mapped

o instances classes values or combinations of them in compli-nce with the ontology to which they subscribe Moreover theider expressiveness adds another layer of complexity mainlyue to the ambiguous nature of human language So naturallyhe question arises of how far the engineering functionality ofhe triple-based model can be extended to map a triple obtainedhrough a NL query into a triple that can be realized by a givenntology

To accomplish this task the triple in the current AquaLog ver-ion has been slightly extended Therefore an existing linguistic

riple now consists of one two or even three terms connectedhrough a relationship A query can be translated into one or moreinguistic-triples and then each linguistic triple can be trans-ated into one or more ontology-compliant-triples Each triple

t

u

Agents on the World Wide Web 5 (2007) 72ndash105

lso has additional lexical features in order to facilitate reason-ng about the answer such as the voice and tense of the relationnother key feature for each triple is its category (see an exam-le table of query types in Appendix A in Section 11) Theseategories identify different basic structures of the NL querynd their equivalent representation in the triple Depending onhe category the triple tells us how to deal with its elementshat inference process is required and what kind of answer cane expected For instance different queries may be representedy triples of the same category since in natural language therean be different ways of asking the same question ie ldquowhoorks in aktrdquo3 and ldquoShow me all researchers involved in the

kt projectrdquo The classification of the triple may be modifieduring its life cycle in compliance with the target ontology itubscribes to

In what follows we provide an overview of each componentf the AquaLog architecture

Linguistic component from questions to query triples

When a query is asked the Linguistic Componentrsquos task is toranslate from NL to the triple format used to query the ontol-gy (Query-Triples) This preprocessing step helps towards theccurate classification of the query and its components by usingtandard research tools and query classification Classificationdentifies the type of question and hence the kind of answerequired In particular AquaLog uses the GATE [1549] infras-ructure and resources (language resources processing resourcesike ANNIE serial controllers pipelines etc) as part of theinguistic Component Communication between AquaLog andATE takes place through the standard GATE APIAfter the execution of the GATE controller a set of syntactic

nnotations associated with the input query are returned In par-icular AquaLog makes use of GATE processing resources fornglish tokenizer sentence splitter POS tagger and VP chun-er The annotations returned after the sequential execution ofhese resources include information about sentences tokensouns and verbs For example we get voice and tense for theerbs and categories for the nouns such as determinant sin-ularplural conjunction possessive determiner prepositionxistential wh-determiner etc These features returned for eachnnotation are important to create the triples for instance in thease of verb annotations together with the voice and sense it alsoncludes information that indicates if it is the main verb of theuery (the one that separates the nominal group and the predi-ate) This information is used by the Linguistic Component tostablish the proper attachment of prepositional phrases previ-us to the creation of the Query-Triples (an example is shownn Section 53)

3 AKT is a EPSRC founded project in which the Open University is one ofhe partners httpwwwaktorsorgakt

4 Hence for example we exploited the fact that natural language commonlyses a preposition to express a relationship For instance in the query ldquowho has

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 77

Fig 5 Example of GATE annotations and linguistic triples for basic queries

lingu

oGfrpgetdtgcoF

mo

ac

t

Aaosot

tlRspQt

Fig 6 Example of GATE annotations and

f questions This is achieved by running through the use ofATE JAPE transducers a set of JAPE grammars that we wrote

or AquaLog JAPE is an expressive regular expression basedule language offered by GATE5 (see Section 42 for an exam-le of the JAPE grammars written for AquaLog) These JAPErammars consist of a set of phases that run sequentially andach phase is defined as a set of pattern rules which allow uso recognize regular expressions using previous annotations inocuments In other words the JAPE grammarsrsquo power lie inheir ability to regard the data stored in the GATE annotationraphs as simple sequences which can be matched deterministi-ally by using regular expressions6 Examples of the annotationbtained after the use of our JAPE grammars can be seen inigs 5ndash7

Currently the linguistic component through the JAPE gram-ars dynamically identifies around 14 different question typesr intermediate representations (examples can be seen in

research interest in semantic services in KMirdquo the relation of the query is aombination of the verb ldquohasrdquo and the noun ldquoresearch interestrdquo5 httpgateacuksaletaoindexhtml6 LC binaries and JAPE grammars can be downloaded httpkmiopenacuk

echnologiesaqualog

etgqapa

d

istic triples for basic queries with clauses

ppendix A of this paper) including basic queries requiring anffirmationnegation or a description as an answer or the big setf queries constituted by a wh-question like ldquoare there any phdtudents in dotkomrdquo where the relation is implicit or unknownr ldquowhich is the job title of Johnrdquo where no information abouthe type of the expected answer is provided etc

Categories not only tell us the kind of solution that needso be achieved but also they give an indication of the mostikely common problems that the Linguistic Component andelation Similarity Services will need to deal with to under-

tand this particular NL query and in consequence it guides therocess of creating the equivalent intermediate representation oruery-Triple For instance in the previous example ldquowhat are

he research areas covered by the akt projectrdquo thanks to the cat-gory the Linguistic Component knows that it has to create oneriple that represents a explicit relationship between an expliciteneric term and a second term It gets the annotations for theuery terms relations and nouns preprocesses them and gener-tes the following Query-Triple 〈research areas covered akt

roject〉 in which ldquoresearch areasrdquo has correctly been identifieds the query term (instead of ldquowhatrdquo)

For the intermediate representation we use the triple-basedata model rather than logic mainly because at this stage we do

78 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

d ling

nrpRtpct

4

nb4eoq

awdopa

4

ttDbstcqS

aaweaKdiipaaV

4(

kldquomswitati

sctw

Fig 7 Example of GATE annotations an

ot have to worry about getting the representation completelyight The role of the intermediate representation is simply torovide an easy and sufficient way to manipulate input for theSS An alternative would be to use more complex linguis-

ic parsing but as discussed by Katz et al [3329] althougharse trees (as for example the NLP parser for Stanford [34])apture syntactic relations and dependencies they are oftenime-consuming and difficult to manipulate

1 Classification of questions

Here we present the different kind of questions and the ratio-ale behind distinguishing these categories We are dealing withasic queries (Section 411) basic queries with clauses (Section12) and combinations of queries (Section 413) We consid-red these three main groups of queries based on the numberf triples needed to generate an equivalent representation of theuery

It is important to emphasize that at this stage all the termsre still strings or arrays of strings without any correspondenceith the ontology This is because the analysis is completelyomain independent and is entirely based on the GATE analysisf the English language The Query-Triple is only a formal sim-lified way of representing the NL-query Each triple representsn explicit relationship between terms

11 Linguistic procedure for basic queriesThere are several different types of basic queries we can men-

ion the affirmative negative query type which are those querieshat requires an affirmation or negation as an answer ie ldquois Johnomingue a phd studentrdquo ldquois Motta working in ibmrdquo Anotherig set of queries are those constituted by a ldquowh-questionrdquouch as the ones starting with what who when where are

here any does anybodyanyone or how many Also imperativeommands like list give tell name etc are treated as wh-ueries (see an example table of query types in Appendix A inection 11)

ldquo

ie

uistic triples for combination of queries

In fact wh-queries are categorized depending on the equiv-lent intermediate representation For example in ldquowho are thecademics involved in the semantic webrdquo the generic tripleill be of the form 〈generic term relation second term〉 in our

xample 〈academics involved semantic web〉 A query withn equivalent triple representation is ldquowhich technologies hasMi producedrdquo where the triple will be technologies has pro-uced KMi〉 However a query like ldquoare there any Phd studentsn aktrdquo has another equivalent representation where the relations implicit or unknown 〈phd students akt〉 Other queries mayrovide little or no information about the type of the expectednswer eg ldquowhat is the job title of Johnrdquo or they can be justgeneric enquiry about someone or something eg ldquowho isanessardquo ldquowhat is an ontologyrdquo (see Fig 5)

12 Linguistic procedure for basic queries with clausesthree-terms queries)

Let us consider the request ldquoList all the projects in thenowledge media institute about the semantic webrdquo where bothin knowledge media instituterdquo and ldquoabout semantic webrdquo areodifiers (ie they modify the meaning of other syntactic con-

tituents) The problem here is to identify the constituent tohich each modifier has to be attached The Relation Similar-

ty Service (RSS) is responsible for resolving this ambiguityhrough the use of the ontology or in an interactive way bysking for user feedback The linguistic componentrsquos task isherefore to pass the ambiguity problem to the RSS through thentermediate representation as part of the translation process

We must remember that all these queries are always repre-ented by just one Query-Triple at the linguistic stage In thease that modifiers are presented the triple will consist of threeerms instead of two for instance in ldquoare there any planet newsritten by researchers from aktrdquo in which we get the terms

planet newsrdquo ldquoresearchersrdquo and ldquoaktrdquo (see Fig 6)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the basic queries with clauses asxplained in Section 52

s and

4

rtuiwn

tgt

FjaTwl

ldquowtls

ldquoat

aldquo

TsmO

ie

4e

atcficrFppota(AAr

V Lopez et al Web Semantics Science Service

13 Linguistic procedure for combination of queriesNevertheless a query can be a composition of two explicit

elationships between terms or a query can be a composition ofwo basic queries In this case the intermediate representationsually consists of two triples one triple per relationship Fornstance in ldquoare there any planet news written by researchersorking in aktrdquo where the two linguistic triples will be 〈planetews written researchers〉 and 〈which are working akt〉

Not only are these complex queries classified depending onhe kind of triples they generate but also the resultant triple cate-ories are the driving force to generate an answer by combininghe triples in an appropriate way

There are three ways in which queries can be combinedirstly queries can be combined by using a ldquoandrdquo or ldquoorrdquo con-

unction operator as in ldquowhich projects are funded by epsrc andre about semantic webrdquo This query will generate two Query-riples 〈funded (projects epsrc)〉 and 〈about (projects semanticeb)〉 and the subsequent answer will be a combination of both

ists obtained after resolving each tripleSecondly a query may be conditioned to a second query as in

which researchers wrote publications related to social aspectsrdquohere the second clause modifies one of the previous terms In

his example and at this stage ambiguity cannot be solved byinguistic procedures therefore the term to be modified by theecond clause remains uncertain

Finally we can have combinations of two basic patterns egwhat is the web address of Peter who works for aktrdquo or ldquowhichre the projects led by Enrico which are related to multimedia

echnologiesrdquo (see Fig 7)

Once the intermediate representation is created prepositionsnd auxiliary words such as ldquoinrdquo ldquoaboutrdquo ldquoofrdquo ldquoatrdquo ldquoforrdquobyrdquo ldquotherdquo ldquodoesrdquo ldquodordquo ldquodidrdquo are removed from the Query-

oVtp

Fig 8 Set of JAPE rules used in the quer

Agents on the World Wide Web 5 (2007) 72ndash105 79

riples because in the current AquaLog implementation theiremantics are not exploited to provide any additional infor-ation to help in the mapping between Query-Triples into thentology-compliant triples (Onto-Triples)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the combinations of queries asxplained in Section 53

2 The use of JAPE grammars in AquaLog illustrativexample

Consider the basic query in Fig 5 ldquowhat are the researchreas covered by the akt projectrdquo as an illustrative example inhe use of JAPE grammars to classify the query and obtain theorrect annotations to create a valid triple Two JAPE grammarles were written for AquaLog The first one in the GATE exe-ution list annotates the nouns ldquoresearchersrdquo and ldquoakt projectsrdquoelations ldquoarerdquo and ldquocovered byrdquo and query terms ldquowhatrdquo (seeig 8) For instance consider the rule RuleNP in Fig 8 Theattern on the left hand side (ie before ldquorarrrdquo) identifies nounhrases Noun phrases are word sequences that start with zeror more determiners (identified by the (DET) part of the pat-ern) Determiners can be followed by zero or more adverbsnd adjectives nouns or coordinating conjunctions in any orderidentified by the ((RB)ADJ)|NOUN|CC) part of the pattern)

noun phrase mandatorily finishes with a noun (NOUN) RBDJ NOUN and CC are macros and act as placeholders for other

ules For example the macro LIST used by the rule RuleQU1

r the macro VERB used by rule RuleREL1 in Fig 8 The macroERB contains the disjunction of seven patterns which means

hat the macro will fire if a word satisfies any of these sevenatterns eg last pattern identified words assigned to the VB

y example to generate annotations

80 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

e que

PaateR

(sTTQ

gntgs

5

tbtrEbtmdi

osq

tvplortoLtbnescss

maoaw

Fig 9 Set of JAPE rules used in th

OS tags POS tags are assigned in the ldquocategoryrdquo feature ofldquoTokenrdquo annotation used in GATE to encode the information

bout the analyzed question Any word sequence identified byhe left hand side of a rule can be referenced in its right hand sizeg relREL = rule = ldquoREL1rdquo is referenced by the macroELATION in the second grammar (Fig 9)

The second grammar file based on the previous annotationsrules) classifies the example query as ldquowh-generictermrdquo Theet of JAPE rules used to classify the query can be seen in Fig 9he pattern fired for that query ldquoQUWhich IS ARE TERM RELA-ION TERMrdquo is marked in bold This pattern relies on the macrosUWhich IS ARE TERM and RELATIONThanks to this architecture that takes advantage of the JAPE

rammars although we can still only deal with a subset ofatural language it is possible to extend this subset in a rela-ively easy way by updating the regular expressions in the JAPErammars This design choice ensures the easy portability of theystem with respect to both ontologies and natural languages

Relation similarity service from triples to answers

The RSS is the backbone of the question-answering sys-em The RSS component is invoked after the NL query haseen transformed into a term-relation form and classified intohe appropriate category The RSS is the main componentesponsible for generating an ontology-compliant logical queryssentially the RSS tries to make sense of the input queryy looking at the structure of the ontology and the informa-

ion stored in the target KBs as well as using string similarity

atching generic lexical resources such as WordNet and aomain-dependent lexicon obtained through the use of a Learn-ng Mechanism explained in Section 55

immi

ry example to classify the queries

An important aspect of the RSS is that it is interactive Inther words when ambiguity arises between two or more pos-ible terms or relations the user will be required to interpret theuery

Relation and concept names are identified and mapped withinhe ontology through the RSS and the Class Similarity Ser-ice (CSS) respectively (the CSS is a component which isart of the Relation Similarity Service) Syntactic techniquesike string algorithms are not useful when the equivalent ontol-gy terms have dissimilar labels from the user term Moreoverelations names can be considered as ldquosecond class citizensrdquohe rationality behind this is that mapping relations basedn its name (labels) is more difficult that mapping conceptsabels in the case of concepts often catch the meaning of

he semantic entity while in relations its meaning is giveny the type of its domain and its range rather than by itsame (typically vaguely defined as eg ldquorelated tordquo) with thexception of the cases in which the relations are presented inome ontologies as a concept (eg has-author modeled as aoncept Author in a given ontology) Therefore the similarityervices should use the ontology semantics to deal with theseituations

Proper names instead are normally mapped into instances byeans of string metric algorithms The disadvantage of such an

pproach is that without some facilities for dealing with unrec-gnized terms some questions are not accepted For instancequestion like ldquois there any researcher called Thompsonrdquoould not be accepted if Thompson is not identified as an

nstance in the current KB However a partial solution is imple-ented for affirmativenegative types of questions where weake sense of questions in which only one of two instances

s recognized eg in the query ldquois Enrico working in ibmrdquo

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 2: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

dnepsqDosoiKeic

lacsttair

iseiwwtr

atlmiwpatattt

staedaam

LAavwctTtj

updf

aACtIdmae

2

LwirSaq

oaKsorimHta

dtii

V Lopez et al Web Semantics Science Service

escribing academic life [1] with information about our person-el projects technologies events etc which is automaticallyxtracted from departmental databases and unstructured webages In the context of standard keyword-based search thisemantic markup makes it possible to ensure that standard searchueries such as ldquoPeter Scott home page KMirdquo actually returnr Peter Scottrsquos home page as their first result rather than somether resource (as is the case when using current non-semanticearch engines on this particular query) Moreover as pointedut above we can also query this semantic markup directly Fornstance we can ask a query such as ldquowhich are the projects inMi related to the semantic web areardquo and thanks to an infer-

nce engine able to reason about the semantic markup and drawnferences from axioms in the ontology we can then get theorrect answer

This scenario is of course very similar to asking naturalanguage queries to databases (NLIDB) which has long beenn area of research in the artificial intelligence and databaseommunities [83021028] even if in the past decade it hasomewhat gone out of fashion [327] However it is our viewhat the semantic web provides a new and potentially very impor-ant context in which results from this area of research can bepplied Moreover interestingly from a research point of viewt provides a new lsquotwistrsquo on the old issues associated with NLDBesearch

As pointed out in Ref [24] the key limitation of the NLnterfaces to databases is that it presumes the knowledge theystem is using to answer the question is a structured knowl-dge base in a limited domain However an important advantages that a knowledge based question answering system can helpith answering questions requiring situation-specific answershere multiple pieces of information need to be combined and

herefore the answers are being inferred at run time rather thaneciting a pre-written paragraph of text [12]

Hence in the first instance the work on the AquaLog querynswering system described in this paper is based on the premisehat the semantic web will benefit from the availability of naturalanguage query interfaces which allow users to query semantic

arkup viewed as a structured knowledge base Moreover sim-larly to the approach we have adopted in the Magpie systeme believe that in the semantic web scenario it makes sense torovide query answering systems on the semantic web whichre portable with respect to ontologies In other words just as inhe case of tools such as Magpie where the user is able to selectn ontology (essentially a semantic viewpoint) and then browsehe web through this semantic filter our AquaLog system allowshe user to choose an ontology and then ask queries with respecto the universe of discourse covered by the ontology

Hence AquaLog our natural language front-end for theemantic web is a portable question-answering system whichakes queries expressed in natural language and an ontologys input and returns answers drawn from one or more knowl-dge bases (KBs) which instantiate the input ontology with

omain-specific information Thus AquaLog is especially suit-ble as front end to organizational semantic intranets wheren organizational ontology is used as the basics for semanticark up Of relevance is the valuable work done in Aqua-

in

P

Agents on the World Wide Web 5 (2007) 72ndash105 73

og on the syntactic and semantic analysis of the questionsquaLog makes use of the GATE NLP platform string metric

lgorithms WordNet and novel ontology-based similarity ser-ices for relations and classes to make sense of user queriesith respect to the target knowledge base Also AquaLog is

oupled with a portable and contextualized learning mechanismo obtain domain-dependent knowledge by creating a lexiconhe learning component ensures that the performance of the sys-

em improves over time in response to the particular communityargon of the end users

Finally although AquaLog has primarily been designed forse with semantic web languages it makes use of a genericlug-in mechanism which means it can be easily interfaced toifferent ontology servers and knowledge representation plat-orms

The paper is organized as follow in Section 2 we presentn example of AquaLog in action In Section 3 we describe thequaLog architecture In Section 4 we describe the Linguisticomponent embedded in AquaLog In Section 5 the novel Rela-

ion Similarity Service In Section 6 the Learning Mechanismn Section 7 the evaluation scenario followed by discussion andirections for future work in Section 8 In Section 9 we sum-arize related work In Section 10 we give a summary Finally

n appendix is attached with examples of NL queries and theirquivalent triple representation

AquaLog in action illustrative example

First of all at a coarse-grained level of abstraction the Aqua-og architecture can be characterized as a cascaded model inhich a NL query gets translated by the Linguistic Component

nto a set of intermediate triple-based representations which areeferred to as the Query-Triples Then the Relation Similarityervice (RSS) component takes as an input these Query-Triplesnd further processes them to produce the ontology-compliantueries called Onto-Triples as shown in Fig 1

The data model is triple-based namely it takes the formf 〈subject predicate object〉 There are two main reasons fordopting a triple-based data model First of all as pointed out byatz et al [31] although not all possible queries can be repre-

ented in the binary relational model in practice these exceptionsccur very infrequently Secondly RDF-based knowledge rep-esentation (KR) formalisms for the semantic web such as RDFtself [47] or OWL [40] also subscribe to this binary relationalodel and express statements as 〈subject predicate object〉ence it makes sense for a query system targeted at the seman-

ic web to adopt a triple-based model that shares the same formats many millions of other triples on the Semantic Web

For example in the context of the academic domain in ourepartment AquaLog is able to translate the question ldquowhat ishe homepage of Peter who has an interest on the semantic webrdquonto the following ontology-compliant logical query 〈whats has-web-address peter-scott〉 and 〈person has-research-

nterest Semantic Web area〉 expressed as a conjunction ofon-ground triples (ie triples containing variables)

In particular consider the query ldquowhat is the homepage ofeterrdquo the role of the RSS is to map the intermediate form

74 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

uaLo

〈n

idatSMbaotac

aao

a(sianoatIhldquo

3

Fig 1 The Aq

what is homepage peter〉 obtained by the linguistic compo-ent into the target ontology-compliant query

The RSS calls the user to participate in the QA process if nonformation is available to AquaLog to disambiguate the queryirectly Letrsquos consider the query in Fig 2 On the left screen were looking for the homepage of Peter By using string metricshe system is unable to disambiguate between Peter-Scott Peter-harpe Peter-Whalley etc Therefore user feedback is requiredoreover on the right screen user feedback is required to disam-

iguate the term ldquohomepagerdquo (is the same as ldquohas-web-addressrdquo)s it is the first time the system came across this term no syn-nyms have been identified in WordNet that relate both labelsogether and the ontology does not provide further ways to dis-mbiguate We ask the system to learn the userrsquos vocabulary andontext for future occasions

In Fig 3 we are asking for the web address of Peter who hasn interest in semantic web In this case AquaLog does not needny assistance from the user given that by analyzing the ontol-gy only one of the ldquoPetersrdquo has an interest in Semantic Web

cpf

Fig 2 Illustrative example of user interac

g data model

nd only one possible ontology relation ldquohas-research-interestrdquotaking into account taxonomy inheritance) exists between ldquoper-onrdquo and the concept ldquoresearch areardquo of which ldquosemantic webrdquos an instance Also the similarity relation between ldquohomepagerdquond ldquohas-web-addressrdquo has been learned by the Learning Mecha-ism in the context of that query arguments so the performancef the system improves over the time When the RSS comescross a similar query it has to access the ontology informa-ion to recreate the context and complete the ontology triplesn that way it realizes that the second part of the query ldquowhoas an interest on the Semantic Webrdquo is a modifier of the termPeterrdquo

AquaLog architecture

AquaLog is implemented in Java as a modular web appli-ation using a clientndashserver architecture Moreover AquaLogrovides an API which allows future integration in other plat-orms and independent use of its components Fig 4 shows

tivity to disambiguate a basic query

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 75

Fig 3 Illustrative example of AquaLog disambiguation

og gl

tAm

voarr

A

ei

wet

Fig 4 The AquaL

he different components of the AquaLog architectural solutions a result of this design our existing version of AquaLog isodular flexible and scalable1

The Linguistic Component and the Relation Similarity Ser-ice (RSS) are the two central components of AquaLog bothf them portable and independent in nature Currently AquaLog

utomatically classifies the query following a linguistic crite-ion The linguistic coverage can be extended through the use ofegular expressions by augmenting the patterns covered by an

1 AquaLog is available for developers as Open Source licensed under thepache License Version 20 httpssourceforgenetprojectsaqualog

uif

l

obal architecture

xisting classification or creating a new one This functionalitys discussed in Section 4

A key feature of AquaLog is the use of a plug-in mechanismhich allows AquaLog to be configured for different Knowl-

dge Representation (KR) languages Currently we subscribeo the Operational Conceptual Modeling Language (OCML)

sing our own OCML-based KR infrastructure [52] Howevern future our aim is also to provide direct plug-in mechanismsor the emerging RDF and OWL servers2

2 We are able to import and export RDF(S) and OWL from OCML so theack of an explicit RDFOWL plug-in is not a problem in practice

7 s and

ewLtirLiofda

htssaanc

cldquoopf

ptcwt

3

scqTmtawdttto

sttll

aiApcatwbbcwads

o

4

toasirtlLG

atEktnvgeaciqceoi

When developing AquaLog we extended the set of annota-tions returned by GATE by identifying noun terms relations4

question indicators (whichwhowhen etc) and patterns or types

6 V Lopez et al Web Semantics Science Service

To reduce the number of callsrequests to the target knowl-dge base and to guarantee real-time question answering evenhen multiple users access the server simultaneously the Aqua-og server accesses and caches basic indexing data from the

arget Knowledge Bases (KBs) at initialization time The cachednformation in the server can be efficiently accessed by theemote clients In our research department KMi since Aqua-og is also integrated into the KMi Semantic Portal [35] the KB

s dynamically and constantly changing with new informationbtained from KMi websites through different agents There-ore a mechanism is provided to update the cached indexingata on the AquaLog server This mechanism is called by thesegents when they update the KB

As regards portability the only entry points which requireuman intervention for adapting AquaLog to a new domain arehe configuration files Through the configuration files it is pos-ible to specify the parameters needed to initialize the AquaLogerver The most important parameters are the ontology namend server login details if necessary the name of the plug-innd slots that correspond to pretty names Pretty names are alter-ative names that an instance may have Optionally the mainoncepts of interest in an ontology can be specified

The other ontology-dependent parameters required in theonfiguration files are the ones which specify that the termwhordquo ldquowhererdquo ldquowhenrdquo corresponds for example to thentology terms ldquopersonorganizationrdquo ldquolocationrdquo and ldquotime-ositionrdquo respectively The independence of the applicationrom the ontology is guaranteed through these parameters

There is no single strategy by which a query can be inter-reted because this process is highly dependent on the type ofhe query So in the next sections we will not attempt to give aomprehensive overview of all the possible strategies but rathere will show a few examples which are illustrative of the way

he different AquaLog components work together

1 AquaLog triple approach

Core to the overall architecture is the triple-based data repre-entation approach A major challenge in the development of theurrent version of AquaLog is to efficiently deal with complexueries in which there could be more than one or two termshese terms may take the form of modifiers that change theeaning of other syntactic constituents and they can be mapped

o instances classes values or combinations of them in compli-nce with the ontology to which they subscribe Moreover theider expressiveness adds another layer of complexity mainlyue to the ambiguous nature of human language So naturallyhe question arises of how far the engineering functionality ofhe triple-based model can be extended to map a triple obtainedhrough a NL query into a triple that can be realized by a givenntology

To accomplish this task the triple in the current AquaLog ver-ion has been slightly extended Therefore an existing linguistic

riple now consists of one two or even three terms connectedhrough a relationship A query can be translated into one or moreinguistic-triples and then each linguistic triple can be trans-ated into one or more ontology-compliant-triples Each triple

t

u

Agents on the World Wide Web 5 (2007) 72ndash105

lso has additional lexical features in order to facilitate reason-ng about the answer such as the voice and tense of the relationnother key feature for each triple is its category (see an exam-le table of query types in Appendix A in Section 11) Theseategories identify different basic structures of the NL querynd their equivalent representation in the triple Depending onhe category the triple tells us how to deal with its elementshat inference process is required and what kind of answer cane expected For instance different queries may be representedy triples of the same category since in natural language therean be different ways of asking the same question ie ldquowhoorks in aktrdquo3 and ldquoShow me all researchers involved in the

kt projectrdquo The classification of the triple may be modifieduring its life cycle in compliance with the target ontology itubscribes to

In what follows we provide an overview of each componentf the AquaLog architecture

Linguistic component from questions to query triples

When a query is asked the Linguistic Componentrsquos task is toranslate from NL to the triple format used to query the ontol-gy (Query-Triples) This preprocessing step helps towards theccurate classification of the query and its components by usingtandard research tools and query classification Classificationdentifies the type of question and hence the kind of answerequired In particular AquaLog uses the GATE [1549] infras-ructure and resources (language resources processing resourcesike ANNIE serial controllers pipelines etc) as part of theinguistic Component Communication between AquaLog andATE takes place through the standard GATE APIAfter the execution of the GATE controller a set of syntactic

nnotations associated with the input query are returned In par-icular AquaLog makes use of GATE processing resources fornglish tokenizer sentence splitter POS tagger and VP chun-er The annotations returned after the sequential execution ofhese resources include information about sentences tokensouns and verbs For example we get voice and tense for theerbs and categories for the nouns such as determinant sin-ularplural conjunction possessive determiner prepositionxistential wh-determiner etc These features returned for eachnnotation are important to create the triples for instance in thease of verb annotations together with the voice and sense it alsoncludes information that indicates if it is the main verb of theuery (the one that separates the nominal group and the predi-ate) This information is used by the Linguistic Component tostablish the proper attachment of prepositional phrases previ-us to the creation of the Query-Triples (an example is shownn Section 53)

3 AKT is a EPSRC founded project in which the Open University is one ofhe partners httpwwwaktorsorgakt

4 Hence for example we exploited the fact that natural language commonlyses a preposition to express a relationship For instance in the query ldquowho has

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 77

Fig 5 Example of GATE annotations and linguistic triples for basic queries

lingu

oGfrpgetdtgcoF

mo

ac

t

Aaosot

tlRspQt

Fig 6 Example of GATE annotations and

f questions This is achieved by running through the use ofATE JAPE transducers a set of JAPE grammars that we wrote

or AquaLog JAPE is an expressive regular expression basedule language offered by GATE5 (see Section 42 for an exam-le of the JAPE grammars written for AquaLog) These JAPErammars consist of a set of phases that run sequentially andach phase is defined as a set of pattern rules which allow uso recognize regular expressions using previous annotations inocuments In other words the JAPE grammarsrsquo power lie inheir ability to regard the data stored in the GATE annotationraphs as simple sequences which can be matched deterministi-ally by using regular expressions6 Examples of the annotationbtained after the use of our JAPE grammars can be seen inigs 5ndash7

Currently the linguistic component through the JAPE gram-ars dynamically identifies around 14 different question typesr intermediate representations (examples can be seen in

research interest in semantic services in KMirdquo the relation of the query is aombination of the verb ldquohasrdquo and the noun ldquoresearch interestrdquo5 httpgateacuksaletaoindexhtml6 LC binaries and JAPE grammars can be downloaded httpkmiopenacuk

echnologiesaqualog

etgqapa

d

istic triples for basic queries with clauses

ppendix A of this paper) including basic queries requiring anffirmationnegation or a description as an answer or the big setf queries constituted by a wh-question like ldquoare there any phdtudents in dotkomrdquo where the relation is implicit or unknownr ldquowhich is the job title of Johnrdquo where no information abouthe type of the expected answer is provided etc

Categories not only tell us the kind of solution that needso be achieved but also they give an indication of the mostikely common problems that the Linguistic Component andelation Similarity Services will need to deal with to under-

tand this particular NL query and in consequence it guides therocess of creating the equivalent intermediate representation oruery-Triple For instance in the previous example ldquowhat are

he research areas covered by the akt projectrdquo thanks to the cat-gory the Linguistic Component knows that it has to create oneriple that represents a explicit relationship between an expliciteneric term and a second term It gets the annotations for theuery terms relations and nouns preprocesses them and gener-tes the following Query-Triple 〈research areas covered akt

roject〉 in which ldquoresearch areasrdquo has correctly been identifieds the query term (instead of ldquowhatrdquo)

For the intermediate representation we use the triple-basedata model rather than logic mainly because at this stage we do

78 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

d ling

nrpRtpct

4

nb4eoq

awdopa

4

ttDbstcqS

aaweaKdiipaaV

4(

kldquomswitati

sctw

Fig 7 Example of GATE annotations an

ot have to worry about getting the representation completelyight The role of the intermediate representation is simply torovide an easy and sufficient way to manipulate input for theSS An alternative would be to use more complex linguis-

ic parsing but as discussed by Katz et al [3329] althougharse trees (as for example the NLP parser for Stanford [34])apture syntactic relations and dependencies they are oftenime-consuming and difficult to manipulate

1 Classification of questions

Here we present the different kind of questions and the ratio-ale behind distinguishing these categories We are dealing withasic queries (Section 411) basic queries with clauses (Section12) and combinations of queries (Section 413) We consid-red these three main groups of queries based on the numberf triples needed to generate an equivalent representation of theuery

It is important to emphasize that at this stage all the termsre still strings or arrays of strings without any correspondenceith the ontology This is because the analysis is completelyomain independent and is entirely based on the GATE analysisf the English language The Query-Triple is only a formal sim-lified way of representing the NL-query Each triple representsn explicit relationship between terms

11 Linguistic procedure for basic queriesThere are several different types of basic queries we can men-

ion the affirmative negative query type which are those querieshat requires an affirmation or negation as an answer ie ldquois Johnomingue a phd studentrdquo ldquois Motta working in ibmrdquo Anotherig set of queries are those constituted by a ldquowh-questionrdquouch as the ones starting with what who when where are

here any does anybodyanyone or how many Also imperativeommands like list give tell name etc are treated as wh-ueries (see an example table of query types in Appendix A inection 11)

ldquo

ie

uistic triples for combination of queries

In fact wh-queries are categorized depending on the equiv-lent intermediate representation For example in ldquowho are thecademics involved in the semantic webrdquo the generic tripleill be of the form 〈generic term relation second term〉 in our

xample 〈academics involved semantic web〉 A query withn equivalent triple representation is ldquowhich technologies hasMi producedrdquo where the triple will be technologies has pro-uced KMi〉 However a query like ldquoare there any Phd studentsn aktrdquo has another equivalent representation where the relations implicit or unknown 〈phd students akt〉 Other queries mayrovide little or no information about the type of the expectednswer eg ldquowhat is the job title of Johnrdquo or they can be justgeneric enquiry about someone or something eg ldquowho isanessardquo ldquowhat is an ontologyrdquo (see Fig 5)

12 Linguistic procedure for basic queries with clausesthree-terms queries)

Let us consider the request ldquoList all the projects in thenowledge media institute about the semantic webrdquo where bothin knowledge media instituterdquo and ldquoabout semantic webrdquo areodifiers (ie they modify the meaning of other syntactic con-

tituents) The problem here is to identify the constituent tohich each modifier has to be attached The Relation Similar-

ty Service (RSS) is responsible for resolving this ambiguityhrough the use of the ontology or in an interactive way bysking for user feedback The linguistic componentrsquos task isherefore to pass the ambiguity problem to the RSS through thentermediate representation as part of the translation process

We must remember that all these queries are always repre-ented by just one Query-Triple at the linguistic stage In thease that modifiers are presented the triple will consist of threeerms instead of two for instance in ldquoare there any planet newsritten by researchers from aktrdquo in which we get the terms

planet newsrdquo ldquoresearchersrdquo and ldquoaktrdquo (see Fig 6)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the basic queries with clauses asxplained in Section 52

s and

4

rtuiwn

tgt

FjaTwl

ldquowtls

ldquoat

aldquo

TsmO

ie

4e

atcficrFppota(AAr

V Lopez et al Web Semantics Science Service

13 Linguistic procedure for combination of queriesNevertheless a query can be a composition of two explicit

elationships between terms or a query can be a composition ofwo basic queries In this case the intermediate representationsually consists of two triples one triple per relationship Fornstance in ldquoare there any planet news written by researchersorking in aktrdquo where the two linguistic triples will be 〈planetews written researchers〉 and 〈which are working akt〉

Not only are these complex queries classified depending onhe kind of triples they generate but also the resultant triple cate-ories are the driving force to generate an answer by combininghe triples in an appropriate way

There are three ways in which queries can be combinedirstly queries can be combined by using a ldquoandrdquo or ldquoorrdquo con-

unction operator as in ldquowhich projects are funded by epsrc andre about semantic webrdquo This query will generate two Query-riples 〈funded (projects epsrc)〉 and 〈about (projects semanticeb)〉 and the subsequent answer will be a combination of both

ists obtained after resolving each tripleSecondly a query may be conditioned to a second query as in

which researchers wrote publications related to social aspectsrdquohere the second clause modifies one of the previous terms In

his example and at this stage ambiguity cannot be solved byinguistic procedures therefore the term to be modified by theecond clause remains uncertain

Finally we can have combinations of two basic patterns egwhat is the web address of Peter who works for aktrdquo or ldquowhichre the projects led by Enrico which are related to multimedia

echnologiesrdquo (see Fig 7)

Once the intermediate representation is created prepositionsnd auxiliary words such as ldquoinrdquo ldquoaboutrdquo ldquoofrdquo ldquoatrdquo ldquoforrdquobyrdquo ldquotherdquo ldquodoesrdquo ldquodordquo ldquodidrdquo are removed from the Query-

oVtp

Fig 8 Set of JAPE rules used in the quer

Agents on the World Wide Web 5 (2007) 72ndash105 79

riples because in the current AquaLog implementation theiremantics are not exploited to provide any additional infor-ation to help in the mapping between Query-Triples into thentology-compliant triples (Onto-Triples)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the combinations of queries asxplained in Section 53

2 The use of JAPE grammars in AquaLog illustrativexample

Consider the basic query in Fig 5 ldquowhat are the researchreas covered by the akt projectrdquo as an illustrative example inhe use of JAPE grammars to classify the query and obtain theorrect annotations to create a valid triple Two JAPE grammarles were written for AquaLog The first one in the GATE exe-ution list annotates the nouns ldquoresearchersrdquo and ldquoakt projectsrdquoelations ldquoarerdquo and ldquocovered byrdquo and query terms ldquowhatrdquo (seeig 8) For instance consider the rule RuleNP in Fig 8 Theattern on the left hand side (ie before ldquorarrrdquo) identifies nounhrases Noun phrases are word sequences that start with zeror more determiners (identified by the (DET) part of the pat-ern) Determiners can be followed by zero or more adverbsnd adjectives nouns or coordinating conjunctions in any orderidentified by the ((RB)ADJ)|NOUN|CC) part of the pattern)

noun phrase mandatorily finishes with a noun (NOUN) RBDJ NOUN and CC are macros and act as placeholders for other

ules For example the macro LIST used by the rule RuleQU1

r the macro VERB used by rule RuleREL1 in Fig 8 The macroERB contains the disjunction of seven patterns which means

hat the macro will fire if a word satisfies any of these sevenatterns eg last pattern identified words assigned to the VB

y example to generate annotations

80 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

e que

PaateR

(sTTQ

gntgs

5

tbtrEbtmdi

osq

tvplortoLtbnescss

maoaw

Fig 9 Set of JAPE rules used in th

OS tags POS tags are assigned in the ldquocategoryrdquo feature ofldquoTokenrdquo annotation used in GATE to encode the information

bout the analyzed question Any word sequence identified byhe left hand side of a rule can be referenced in its right hand sizeg relREL = rule = ldquoREL1rdquo is referenced by the macroELATION in the second grammar (Fig 9)

The second grammar file based on the previous annotationsrules) classifies the example query as ldquowh-generictermrdquo Theet of JAPE rules used to classify the query can be seen in Fig 9he pattern fired for that query ldquoQUWhich IS ARE TERM RELA-ION TERMrdquo is marked in bold This pattern relies on the macrosUWhich IS ARE TERM and RELATIONThanks to this architecture that takes advantage of the JAPE

rammars although we can still only deal with a subset ofatural language it is possible to extend this subset in a rela-ively easy way by updating the regular expressions in the JAPErammars This design choice ensures the easy portability of theystem with respect to both ontologies and natural languages

Relation similarity service from triples to answers

The RSS is the backbone of the question-answering sys-em The RSS component is invoked after the NL query haseen transformed into a term-relation form and classified intohe appropriate category The RSS is the main componentesponsible for generating an ontology-compliant logical queryssentially the RSS tries to make sense of the input queryy looking at the structure of the ontology and the informa-

ion stored in the target KBs as well as using string similarity

atching generic lexical resources such as WordNet and aomain-dependent lexicon obtained through the use of a Learn-ng Mechanism explained in Section 55

immi

ry example to classify the queries

An important aspect of the RSS is that it is interactive Inther words when ambiguity arises between two or more pos-ible terms or relations the user will be required to interpret theuery

Relation and concept names are identified and mapped withinhe ontology through the RSS and the Class Similarity Ser-ice (CSS) respectively (the CSS is a component which isart of the Relation Similarity Service) Syntactic techniquesike string algorithms are not useful when the equivalent ontol-gy terms have dissimilar labels from the user term Moreoverelations names can be considered as ldquosecond class citizensrdquohe rationality behind this is that mapping relations basedn its name (labels) is more difficult that mapping conceptsabels in the case of concepts often catch the meaning of

he semantic entity while in relations its meaning is giveny the type of its domain and its range rather than by itsame (typically vaguely defined as eg ldquorelated tordquo) with thexception of the cases in which the relations are presented inome ontologies as a concept (eg has-author modeled as aoncept Author in a given ontology) Therefore the similarityervices should use the ontology semantics to deal with theseituations

Proper names instead are normally mapped into instances byeans of string metric algorithms The disadvantage of such an

pproach is that without some facilities for dealing with unrec-gnized terms some questions are not accepted For instancequestion like ldquois there any researcher called Thompsonrdquoould not be accepted if Thompson is not identified as an

nstance in the current KB However a partial solution is imple-ented for affirmativenegative types of questions where weake sense of questions in which only one of two instances

s recognized eg in the query ldquois Enrico working in ibmrdquo

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 3: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

74 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

uaLo

〈n

idatSMbaotac

aao

a(sianoatIhldquo

3

Fig 1 The Aq

what is homepage peter〉 obtained by the linguistic compo-ent into the target ontology-compliant query

The RSS calls the user to participate in the QA process if nonformation is available to AquaLog to disambiguate the queryirectly Letrsquos consider the query in Fig 2 On the left screen were looking for the homepage of Peter By using string metricshe system is unable to disambiguate between Peter-Scott Peter-harpe Peter-Whalley etc Therefore user feedback is requiredoreover on the right screen user feedback is required to disam-

iguate the term ldquohomepagerdquo (is the same as ldquohas-web-addressrdquo)s it is the first time the system came across this term no syn-nyms have been identified in WordNet that relate both labelsogether and the ontology does not provide further ways to dis-mbiguate We ask the system to learn the userrsquos vocabulary andontext for future occasions

In Fig 3 we are asking for the web address of Peter who hasn interest in semantic web In this case AquaLog does not needny assistance from the user given that by analyzing the ontol-gy only one of the ldquoPetersrdquo has an interest in Semantic Web

cpf

Fig 2 Illustrative example of user interac

g data model

nd only one possible ontology relation ldquohas-research-interestrdquotaking into account taxonomy inheritance) exists between ldquoper-onrdquo and the concept ldquoresearch areardquo of which ldquosemantic webrdquos an instance Also the similarity relation between ldquohomepagerdquond ldquohas-web-addressrdquo has been learned by the Learning Mecha-ism in the context of that query arguments so the performancef the system improves over the time When the RSS comescross a similar query it has to access the ontology informa-ion to recreate the context and complete the ontology triplesn that way it realizes that the second part of the query ldquowhoas an interest on the Semantic Webrdquo is a modifier of the termPeterrdquo

AquaLog architecture

AquaLog is implemented in Java as a modular web appli-ation using a clientndashserver architecture Moreover AquaLogrovides an API which allows future integration in other plat-orms and independent use of its components Fig 4 shows

tivity to disambiguate a basic query

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 75

Fig 3 Illustrative example of AquaLog disambiguation

og gl

tAm

voarr

A

ei

wet

Fig 4 The AquaL

he different components of the AquaLog architectural solutions a result of this design our existing version of AquaLog isodular flexible and scalable1

The Linguistic Component and the Relation Similarity Ser-ice (RSS) are the two central components of AquaLog bothf them portable and independent in nature Currently AquaLog

utomatically classifies the query following a linguistic crite-ion The linguistic coverage can be extended through the use ofegular expressions by augmenting the patterns covered by an

1 AquaLog is available for developers as Open Source licensed under thepache License Version 20 httpssourceforgenetprojectsaqualog

uif

l

obal architecture

xisting classification or creating a new one This functionalitys discussed in Section 4

A key feature of AquaLog is the use of a plug-in mechanismhich allows AquaLog to be configured for different Knowl-

dge Representation (KR) languages Currently we subscribeo the Operational Conceptual Modeling Language (OCML)

sing our own OCML-based KR infrastructure [52] Howevern future our aim is also to provide direct plug-in mechanismsor the emerging RDF and OWL servers2

2 We are able to import and export RDF(S) and OWL from OCML so theack of an explicit RDFOWL plug-in is not a problem in practice

7 s and

ewLtirLiofda

htssaanc

cldquoopf

ptcwt

3

scqTmtawdttto

sttll

aiApcatwbbcwads

o

4

toasirtlLG

atEktnvgeaciqceoi

When developing AquaLog we extended the set of annota-tions returned by GATE by identifying noun terms relations4

question indicators (whichwhowhen etc) and patterns or types

6 V Lopez et al Web Semantics Science Service

To reduce the number of callsrequests to the target knowl-dge base and to guarantee real-time question answering evenhen multiple users access the server simultaneously the Aqua-og server accesses and caches basic indexing data from the

arget Knowledge Bases (KBs) at initialization time The cachednformation in the server can be efficiently accessed by theemote clients In our research department KMi since Aqua-og is also integrated into the KMi Semantic Portal [35] the KB

s dynamically and constantly changing with new informationbtained from KMi websites through different agents There-ore a mechanism is provided to update the cached indexingata on the AquaLog server This mechanism is called by thesegents when they update the KB

As regards portability the only entry points which requireuman intervention for adapting AquaLog to a new domain arehe configuration files Through the configuration files it is pos-ible to specify the parameters needed to initialize the AquaLogerver The most important parameters are the ontology namend server login details if necessary the name of the plug-innd slots that correspond to pretty names Pretty names are alter-ative names that an instance may have Optionally the mainoncepts of interest in an ontology can be specified

The other ontology-dependent parameters required in theonfiguration files are the ones which specify that the termwhordquo ldquowhererdquo ldquowhenrdquo corresponds for example to thentology terms ldquopersonorganizationrdquo ldquolocationrdquo and ldquotime-ositionrdquo respectively The independence of the applicationrom the ontology is guaranteed through these parameters

There is no single strategy by which a query can be inter-reted because this process is highly dependent on the type ofhe query So in the next sections we will not attempt to give aomprehensive overview of all the possible strategies but rathere will show a few examples which are illustrative of the way

he different AquaLog components work together

1 AquaLog triple approach

Core to the overall architecture is the triple-based data repre-entation approach A major challenge in the development of theurrent version of AquaLog is to efficiently deal with complexueries in which there could be more than one or two termshese terms may take the form of modifiers that change theeaning of other syntactic constituents and they can be mapped

o instances classes values or combinations of them in compli-nce with the ontology to which they subscribe Moreover theider expressiveness adds another layer of complexity mainlyue to the ambiguous nature of human language So naturallyhe question arises of how far the engineering functionality ofhe triple-based model can be extended to map a triple obtainedhrough a NL query into a triple that can be realized by a givenntology

To accomplish this task the triple in the current AquaLog ver-ion has been slightly extended Therefore an existing linguistic

riple now consists of one two or even three terms connectedhrough a relationship A query can be translated into one or moreinguistic-triples and then each linguistic triple can be trans-ated into one or more ontology-compliant-triples Each triple

t

u

Agents on the World Wide Web 5 (2007) 72ndash105

lso has additional lexical features in order to facilitate reason-ng about the answer such as the voice and tense of the relationnother key feature for each triple is its category (see an exam-le table of query types in Appendix A in Section 11) Theseategories identify different basic structures of the NL querynd their equivalent representation in the triple Depending onhe category the triple tells us how to deal with its elementshat inference process is required and what kind of answer cane expected For instance different queries may be representedy triples of the same category since in natural language therean be different ways of asking the same question ie ldquowhoorks in aktrdquo3 and ldquoShow me all researchers involved in the

kt projectrdquo The classification of the triple may be modifieduring its life cycle in compliance with the target ontology itubscribes to

In what follows we provide an overview of each componentf the AquaLog architecture

Linguistic component from questions to query triples

When a query is asked the Linguistic Componentrsquos task is toranslate from NL to the triple format used to query the ontol-gy (Query-Triples) This preprocessing step helps towards theccurate classification of the query and its components by usingtandard research tools and query classification Classificationdentifies the type of question and hence the kind of answerequired In particular AquaLog uses the GATE [1549] infras-ructure and resources (language resources processing resourcesike ANNIE serial controllers pipelines etc) as part of theinguistic Component Communication between AquaLog andATE takes place through the standard GATE APIAfter the execution of the GATE controller a set of syntactic

nnotations associated with the input query are returned In par-icular AquaLog makes use of GATE processing resources fornglish tokenizer sentence splitter POS tagger and VP chun-er The annotations returned after the sequential execution ofhese resources include information about sentences tokensouns and verbs For example we get voice and tense for theerbs and categories for the nouns such as determinant sin-ularplural conjunction possessive determiner prepositionxistential wh-determiner etc These features returned for eachnnotation are important to create the triples for instance in thease of verb annotations together with the voice and sense it alsoncludes information that indicates if it is the main verb of theuery (the one that separates the nominal group and the predi-ate) This information is used by the Linguistic Component tostablish the proper attachment of prepositional phrases previ-us to the creation of the Query-Triples (an example is shownn Section 53)

3 AKT is a EPSRC founded project in which the Open University is one ofhe partners httpwwwaktorsorgakt

4 Hence for example we exploited the fact that natural language commonlyses a preposition to express a relationship For instance in the query ldquowho has

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 77

Fig 5 Example of GATE annotations and linguistic triples for basic queries

lingu

oGfrpgetdtgcoF

mo

ac

t

Aaosot

tlRspQt

Fig 6 Example of GATE annotations and

f questions This is achieved by running through the use ofATE JAPE transducers a set of JAPE grammars that we wrote

or AquaLog JAPE is an expressive regular expression basedule language offered by GATE5 (see Section 42 for an exam-le of the JAPE grammars written for AquaLog) These JAPErammars consist of a set of phases that run sequentially andach phase is defined as a set of pattern rules which allow uso recognize regular expressions using previous annotations inocuments In other words the JAPE grammarsrsquo power lie inheir ability to regard the data stored in the GATE annotationraphs as simple sequences which can be matched deterministi-ally by using regular expressions6 Examples of the annotationbtained after the use of our JAPE grammars can be seen inigs 5ndash7

Currently the linguistic component through the JAPE gram-ars dynamically identifies around 14 different question typesr intermediate representations (examples can be seen in

research interest in semantic services in KMirdquo the relation of the query is aombination of the verb ldquohasrdquo and the noun ldquoresearch interestrdquo5 httpgateacuksaletaoindexhtml6 LC binaries and JAPE grammars can be downloaded httpkmiopenacuk

echnologiesaqualog

etgqapa

d

istic triples for basic queries with clauses

ppendix A of this paper) including basic queries requiring anffirmationnegation or a description as an answer or the big setf queries constituted by a wh-question like ldquoare there any phdtudents in dotkomrdquo where the relation is implicit or unknownr ldquowhich is the job title of Johnrdquo where no information abouthe type of the expected answer is provided etc

Categories not only tell us the kind of solution that needso be achieved but also they give an indication of the mostikely common problems that the Linguistic Component andelation Similarity Services will need to deal with to under-

tand this particular NL query and in consequence it guides therocess of creating the equivalent intermediate representation oruery-Triple For instance in the previous example ldquowhat are

he research areas covered by the akt projectrdquo thanks to the cat-gory the Linguistic Component knows that it has to create oneriple that represents a explicit relationship between an expliciteneric term and a second term It gets the annotations for theuery terms relations and nouns preprocesses them and gener-tes the following Query-Triple 〈research areas covered akt

roject〉 in which ldquoresearch areasrdquo has correctly been identifieds the query term (instead of ldquowhatrdquo)

For the intermediate representation we use the triple-basedata model rather than logic mainly because at this stage we do

78 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

d ling

nrpRtpct

4

nb4eoq

awdopa

4

ttDbstcqS

aaweaKdiipaaV

4(

kldquomswitati

sctw

Fig 7 Example of GATE annotations an

ot have to worry about getting the representation completelyight The role of the intermediate representation is simply torovide an easy and sufficient way to manipulate input for theSS An alternative would be to use more complex linguis-

ic parsing but as discussed by Katz et al [3329] althougharse trees (as for example the NLP parser for Stanford [34])apture syntactic relations and dependencies they are oftenime-consuming and difficult to manipulate

1 Classification of questions

Here we present the different kind of questions and the ratio-ale behind distinguishing these categories We are dealing withasic queries (Section 411) basic queries with clauses (Section12) and combinations of queries (Section 413) We consid-red these three main groups of queries based on the numberf triples needed to generate an equivalent representation of theuery

It is important to emphasize that at this stage all the termsre still strings or arrays of strings without any correspondenceith the ontology This is because the analysis is completelyomain independent and is entirely based on the GATE analysisf the English language The Query-Triple is only a formal sim-lified way of representing the NL-query Each triple representsn explicit relationship between terms

11 Linguistic procedure for basic queriesThere are several different types of basic queries we can men-

ion the affirmative negative query type which are those querieshat requires an affirmation or negation as an answer ie ldquois Johnomingue a phd studentrdquo ldquois Motta working in ibmrdquo Anotherig set of queries are those constituted by a ldquowh-questionrdquouch as the ones starting with what who when where are

here any does anybodyanyone or how many Also imperativeommands like list give tell name etc are treated as wh-ueries (see an example table of query types in Appendix A inection 11)

ldquo

ie

uistic triples for combination of queries

In fact wh-queries are categorized depending on the equiv-lent intermediate representation For example in ldquowho are thecademics involved in the semantic webrdquo the generic tripleill be of the form 〈generic term relation second term〉 in our

xample 〈academics involved semantic web〉 A query withn equivalent triple representation is ldquowhich technologies hasMi producedrdquo where the triple will be technologies has pro-uced KMi〉 However a query like ldquoare there any Phd studentsn aktrdquo has another equivalent representation where the relations implicit or unknown 〈phd students akt〉 Other queries mayrovide little or no information about the type of the expectednswer eg ldquowhat is the job title of Johnrdquo or they can be justgeneric enquiry about someone or something eg ldquowho isanessardquo ldquowhat is an ontologyrdquo (see Fig 5)

12 Linguistic procedure for basic queries with clausesthree-terms queries)

Let us consider the request ldquoList all the projects in thenowledge media institute about the semantic webrdquo where bothin knowledge media instituterdquo and ldquoabout semantic webrdquo areodifiers (ie they modify the meaning of other syntactic con-

tituents) The problem here is to identify the constituent tohich each modifier has to be attached The Relation Similar-

ty Service (RSS) is responsible for resolving this ambiguityhrough the use of the ontology or in an interactive way bysking for user feedback The linguistic componentrsquos task isherefore to pass the ambiguity problem to the RSS through thentermediate representation as part of the translation process

We must remember that all these queries are always repre-ented by just one Query-Triple at the linguistic stage In thease that modifiers are presented the triple will consist of threeerms instead of two for instance in ldquoare there any planet newsritten by researchers from aktrdquo in which we get the terms

planet newsrdquo ldquoresearchersrdquo and ldquoaktrdquo (see Fig 6)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the basic queries with clauses asxplained in Section 52

s and

4

rtuiwn

tgt

FjaTwl

ldquowtls

ldquoat

aldquo

TsmO

ie

4e

atcficrFppota(AAr

V Lopez et al Web Semantics Science Service

13 Linguistic procedure for combination of queriesNevertheless a query can be a composition of two explicit

elationships between terms or a query can be a composition ofwo basic queries In this case the intermediate representationsually consists of two triples one triple per relationship Fornstance in ldquoare there any planet news written by researchersorking in aktrdquo where the two linguistic triples will be 〈planetews written researchers〉 and 〈which are working akt〉

Not only are these complex queries classified depending onhe kind of triples they generate but also the resultant triple cate-ories are the driving force to generate an answer by combininghe triples in an appropriate way

There are three ways in which queries can be combinedirstly queries can be combined by using a ldquoandrdquo or ldquoorrdquo con-

unction operator as in ldquowhich projects are funded by epsrc andre about semantic webrdquo This query will generate two Query-riples 〈funded (projects epsrc)〉 and 〈about (projects semanticeb)〉 and the subsequent answer will be a combination of both

ists obtained after resolving each tripleSecondly a query may be conditioned to a second query as in

which researchers wrote publications related to social aspectsrdquohere the second clause modifies one of the previous terms In

his example and at this stage ambiguity cannot be solved byinguistic procedures therefore the term to be modified by theecond clause remains uncertain

Finally we can have combinations of two basic patterns egwhat is the web address of Peter who works for aktrdquo or ldquowhichre the projects led by Enrico which are related to multimedia

echnologiesrdquo (see Fig 7)

Once the intermediate representation is created prepositionsnd auxiliary words such as ldquoinrdquo ldquoaboutrdquo ldquoofrdquo ldquoatrdquo ldquoforrdquobyrdquo ldquotherdquo ldquodoesrdquo ldquodordquo ldquodidrdquo are removed from the Query-

oVtp

Fig 8 Set of JAPE rules used in the quer

Agents on the World Wide Web 5 (2007) 72ndash105 79

riples because in the current AquaLog implementation theiremantics are not exploited to provide any additional infor-ation to help in the mapping between Query-Triples into thentology-compliant triples (Onto-Triples)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the combinations of queries asxplained in Section 53

2 The use of JAPE grammars in AquaLog illustrativexample

Consider the basic query in Fig 5 ldquowhat are the researchreas covered by the akt projectrdquo as an illustrative example inhe use of JAPE grammars to classify the query and obtain theorrect annotations to create a valid triple Two JAPE grammarles were written for AquaLog The first one in the GATE exe-ution list annotates the nouns ldquoresearchersrdquo and ldquoakt projectsrdquoelations ldquoarerdquo and ldquocovered byrdquo and query terms ldquowhatrdquo (seeig 8) For instance consider the rule RuleNP in Fig 8 Theattern on the left hand side (ie before ldquorarrrdquo) identifies nounhrases Noun phrases are word sequences that start with zeror more determiners (identified by the (DET) part of the pat-ern) Determiners can be followed by zero or more adverbsnd adjectives nouns or coordinating conjunctions in any orderidentified by the ((RB)ADJ)|NOUN|CC) part of the pattern)

noun phrase mandatorily finishes with a noun (NOUN) RBDJ NOUN and CC are macros and act as placeholders for other

ules For example the macro LIST used by the rule RuleQU1

r the macro VERB used by rule RuleREL1 in Fig 8 The macroERB contains the disjunction of seven patterns which means

hat the macro will fire if a word satisfies any of these sevenatterns eg last pattern identified words assigned to the VB

y example to generate annotations

80 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

e que

PaateR

(sTTQ

gntgs

5

tbtrEbtmdi

osq

tvplortoLtbnescss

maoaw

Fig 9 Set of JAPE rules used in th

OS tags POS tags are assigned in the ldquocategoryrdquo feature ofldquoTokenrdquo annotation used in GATE to encode the information

bout the analyzed question Any word sequence identified byhe left hand side of a rule can be referenced in its right hand sizeg relREL = rule = ldquoREL1rdquo is referenced by the macroELATION in the second grammar (Fig 9)

The second grammar file based on the previous annotationsrules) classifies the example query as ldquowh-generictermrdquo Theet of JAPE rules used to classify the query can be seen in Fig 9he pattern fired for that query ldquoQUWhich IS ARE TERM RELA-ION TERMrdquo is marked in bold This pattern relies on the macrosUWhich IS ARE TERM and RELATIONThanks to this architecture that takes advantage of the JAPE

rammars although we can still only deal with a subset ofatural language it is possible to extend this subset in a rela-ively easy way by updating the regular expressions in the JAPErammars This design choice ensures the easy portability of theystem with respect to both ontologies and natural languages

Relation similarity service from triples to answers

The RSS is the backbone of the question-answering sys-em The RSS component is invoked after the NL query haseen transformed into a term-relation form and classified intohe appropriate category The RSS is the main componentesponsible for generating an ontology-compliant logical queryssentially the RSS tries to make sense of the input queryy looking at the structure of the ontology and the informa-

ion stored in the target KBs as well as using string similarity

atching generic lexical resources such as WordNet and aomain-dependent lexicon obtained through the use of a Learn-ng Mechanism explained in Section 55

immi

ry example to classify the queries

An important aspect of the RSS is that it is interactive Inther words when ambiguity arises between two or more pos-ible terms or relations the user will be required to interpret theuery

Relation and concept names are identified and mapped withinhe ontology through the RSS and the Class Similarity Ser-ice (CSS) respectively (the CSS is a component which isart of the Relation Similarity Service) Syntactic techniquesike string algorithms are not useful when the equivalent ontol-gy terms have dissimilar labels from the user term Moreoverelations names can be considered as ldquosecond class citizensrdquohe rationality behind this is that mapping relations basedn its name (labels) is more difficult that mapping conceptsabels in the case of concepts often catch the meaning of

he semantic entity while in relations its meaning is giveny the type of its domain and its range rather than by itsame (typically vaguely defined as eg ldquorelated tordquo) with thexception of the cases in which the relations are presented inome ontologies as a concept (eg has-author modeled as aoncept Author in a given ontology) Therefore the similarityervices should use the ontology semantics to deal with theseituations

Proper names instead are normally mapped into instances byeans of string metric algorithms The disadvantage of such an

pproach is that without some facilities for dealing with unrec-gnized terms some questions are not accepted For instancequestion like ldquois there any researcher called Thompsonrdquoould not be accepted if Thompson is not identified as an

nstance in the current KB However a partial solution is imple-ented for affirmativenegative types of questions where weake sense of questions in which only one of two instances

s recognized eg in the query ldquois Enrico working in ibmrdquo

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 4: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 75

Fig 3 Illustrative example of AquaLog disambiguation

og gl

tAm

voarr

A

ei

wet

Fig 4 The AquaL

he different components of the AquaLog architectural solutions a result of this design our existing version of AquaLog isodular flexible and scalable1

The Linguistic Component and the Relation Similarity Ser-ice (RSS) are the two central components of AquaLog bothf them portable and independent in nature Currently AquaLog

utomatically classifies the query following a linguistic crite-ion The linguistic coverage can be extended through the use ofegular expressions by augmenting the patterns covered by an

1 AquaLog is available for developers as Open Source licensed under thepache License Version 20 httpssourceforgenetprojectsaqualog

uif

l

obal architecture

xisting classification or creating a new one This functionalitys discussed in Section 4

A key feature of AquaLog is the use of a plug-in mechanismhich allows AquaLog to be configured for different Knowl-

dge Representation (KR) languages Currently we subscribeo the Operational Conceptual Modeling Language (OCML)

sing our own OCML-based KR infrastructure [52] Howevern future our aim is also to provide direct plug-in mechanismsor the emerging RDF and OWL servers2

2 We are able to import and export RDF(S) and OWL from OCML so theack of an explicit RDFOWL plug-in is not a problem in practice

7 s and

ewLtirLiofda

htssaanc

cldquoopf

ptcwt

3

scqTmtawdttto

sttll

aiApcatwbbcwads

o

4

toasirtlLG

atEktnvgeaciqceoi

When developing AquaLog we extended the set of annota-tions returned by GATE by identifying noun terms relations4

question indicators (whichwhowhen etc) and patterns or types

6 V Lopez et al Web Semantics Science Service

To reduce the number of callsrequests to the target knowl-dge base and to guarantee real-time question answering evenhen multiple users access the server simultaneously the Aqua-og server accesses and caches basic indexing data from the

arget Knowledge Bases (KBs) at initialization time The cachednformation in the server can be efficiently accessed by theemote clients In our research department KMi since Aqua-og is also integrated into the KMi Semantic Portal [35] the KB

s dynamically and constantly changing with new informationbtained from KMi websites through different agents There-ore a mechanism is provided to update the cached indexingata on the AquaLog server This mechanism is called by thesegents when they update the KB

As regards portability the only entry points which requireuman intervention for adapting AquaLog to a new domain arehe configuration files Through the configuration files it is pos-ible to specify the parameters needed to initialize the AquaLogerver The most important parameters are the ontology namend server login details if necessary the name of the plug-innd slots that correspond to pretty names Pretty names are alter-ative names that an instance may have Optionally the mainoncepts of interest in an ontology can be specified

The other ontology-dependent parameters required in theonfiguration files are the ones which specify that the termwhordquo ldquowhererdquo ldquowhenrdquo corresponds for example to thentology terms ldquopersonorganizationrdquo ldquolocationrdquo and ldquotime-ositionrdquo respectively The independence of the applicationrom the ontology is guaranteed through these parameters

There is no single strategy by which a query can be inter-reted because this process is highly dependent on the type ofhe query So in the next sections we will not attempt to give aomprehensive overview of all the possible strategies but rathere will show a few examples which are illustrative of the way

he different AquaLog components work together

1 AquaLog triple approach

Core to the overall architecture is the triple-based data repre-entation approach A major challenge in the development of theurrent version of AquaLog is to efficiently deal with complexueries in which there could be more than one or two termshese terms may take the form of modifiers that change theeaning of other syntactic constituents and they can be mapped

o instances classes values or combinations of them in compli-nce with the ontology to which they subscribe Moreover theider expressiveness adds another layer of complexity mainlyue to the ambiguous nature of human language So naturallyhe question arises of how far the engineering functionality ofhe triple-based model can be extended to map a triple obtainedhrough a NL query into a triple that can be realized by a givenntology

To accomplish this task the triple in the current AquaLog ver-ion has been slightly extended Therefore an existing linguistic

riple now consists of one two or even three terms connectedhrough a relationship A query can be translated into one or moreinguistic-triples and then each linguistic triple can be trans-ated into one or more ontology-compliant-triples Each triple

t

u

Agents on the World Wide Web 5 (2007) 72ndash105

lso has additional lexical features in order to facilitate reason-ng about the answer such as the voice and tense of the relationnother key feature for each triple is its category (see an exam-le table of query types in Appendix A in Section 11) Theseategories identify different basic structures of the NL querynd their equivalent representation in the triple Depending onhe category the triple tells us how to deal with its elementshat inference process is required and what kind of answer cane expected For instance different queries may be representedy triples of the same category since in natural language therean be different ways of asking the same question ie ldquowhoorks in aktrdquo3 and ldquoShow me all researchers involved in the

kt projectrdquo The classification of the triple may be modifieduring its life cycle in compliance with the target ontology itubscribes to

In what follows we provide an overview of each componentf the AquaLog architecture

Linguistic component from questions to query triples

When a query is asked the Linguistic Componentrsquos task is toranslate from NL to the triple format used to query the ontol-gy (Query-Triples) This preprocessing step helps towards theccurate classification of the query and its components by usingtandard research tools and query classification Classificationdentifies the type of question and hence the kind of answerequired In particular AquaLog uses the GATE [1549] infras-ructure and resources (language resources processing resourcesike ANNIE serial controllers pipelines etc) as part of theinguistic Component Communication between AquaLog andATE takes place through the standard GATE APIAfter the execution of the GATE controller a set of syntactic

nnotations associated with the input query are returned In par-icular AquaLog makes use of GATE processing resources fornglish tokenizer sentence splitter POS tagger and VP chun-er The annotations returned after the sequential execution ofhese resources include information about sentences tokensouns and verbs For example we get voice and tense for theerbs and categories for the nouns such as determinant sin-ularplural conjunction possessive determiner prepositionxistential wh-determiner etc These features returned for eachnnotation are important to create the triples for instance in thease of verb annotations together with the voice and sense it alsoncludes information that indicates if it is the main verb of theuery (the one that separates the nominal group and the predi-ate) This information is used by the Linguistic Component tostablish the proper attachment of prepositional phrases previ-us to the creation of the Query-Triples (an example is shownn Section 53)

3 AKT is a EPSRC founded project in which the Open University is one ofhe partners httpwwwaktorsorgakt

4 Hence for example we exploited the fact that natural language commonlyses a preposition to express a relationship For instance in the query ldquowho has

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 77

Fig 5 Example of GATE annotations and linguistic triples for basic queries

lingu

oGfrpgetdtgcoF

mo

ac

t

Aaosot

tlRspQt

Fig 6 Example of GATE annotations and

f questions This is achieved by running through the use ofATE JAPE transducers a set of JAPE grammars that we wrote

or AquaLog JAPE is an expressive regular expression basedule language offered by GATE5 (see Section 42 for an exam-le of the JAPE grammars written for AquaLog) These JAPErammars consist of a set of phases that run sequentially andach phase is defined as a set of pattern rules which allow uso recognize regular expressions using previous annotations inocuments In other words the JAPE grammarsrsquo power lie inheir ability to regard the data stored in the GATE annotationraphs as simple sequences which can be matched deterministi-ally by using regular expressions6 Examples of the annotationbtained after the use of our JAPE grammars can be seen inigs 5ndash7

Currently the linguistic component through the JAPE gram-ars dynamically identifies around 14 different question typesr intermediate representations (examples can be seen in

research interest in semantic services in KMirdquo the relation of the query is aombination of the verb ldquohasrdquo and the noun ldquoresearch interestrdquo5 httpgateacuksaletaoindexhtml6 LC binaries and JAPE grammars can be downloaded httpkmiopenacuk

echnologiesaqualog

etgqapa

d

istic triples for basic queries with clauses

ppendix A of this paper) including basic queries requiring anffirmationnegation or a description as an answer or the big setf queries constituted by a wh-question like ldquoare there any phdtudents in dotkomrdquo where the relation is implicit or unknownr ldquowhich is the job title of Johnrdquo where no information abouthe type of the expected answer is provided etc

Categories not only tell us the kind of solution that needso be achieved but also they give an indication of the mostikely common problems that the Linguistic Component andelation Similarity Services will need to deal with to under-

tand this particular NL query and in consequence it guides therocess of creating the equivalent intermediate representation oruery-Triple For instance in the previous example ldquowhat are

he research areas covered by the akt projectrdquo thanks to the cat-gory the Linguistic Component knows that it has to create oneriple that represents a explicit relationship between an expliciteneric term and a second term It gets the annotations for theuery terms relations and nouns preprocesses them and gener-tes the following Query-Triple 〈research areas covered akt

roject〉 in which ldquoresearch areasrdquo has correctly been identifieds the query term (instead of ldquowhatrdquo)

For the intermediate representation we use the triple-basedata model rather than logic mainly because at this stage we do

78 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

d ling

nrpRtpct

4

nb4eoq

awdopa

4

ttDbstcqS

aaweaKdiipaaV

4(

kldquomswitati

sctw

Fig 7 Example of GATE annotations an

ot have to worry about getting the representation completelyight The role of the intermediate representation is simply torovide an easy and sufficient way to manipulate input for theSS An alternative would be to use more complex linguis-

ic parsing but as discussed by Katz et al [3329] althougharse trees (as for example the NLP parser for Stanford [34])apture syntactic relations and dependencies they are oftenime-consuming and difficult to manipulate

1 Classification of questions

Here we present the different kind of questions and the ratio-ale behind distinguishing these categories We are dealing withasic queries (Section 411) basic queries with clauses (Section12) and combinations of queries (Section 413) We consid-red these three main groups of queries based on the numberf triples needed to generate an equivalent representation of theuery

It is important to emphasize that at this stage all the termsre still strings or arrays of strings without any correspondenceith the ontology This is because the analysis is completelyomain independent and is entirely based on the GATE analysisf the English language The Query-Triple is only a formal sim-lified way of representing the NL-query Each triple representsn explicit relationship between terms

11 Linguistic procedure for basic queriesThere are several different types of basic queries we can men-

ion the affirmative negative query type which are those querieshat requires an affirmation or negation as an answer ie ldquois Johnomingue a phd studentrdquo ldquois Motta working in ibmrdquo Anotherig set of queries are those constituted by a ldquowh-questionrdquouch as the ones starting with what who when where are

here any does anybodyanyone or how many Also imperativeommands like list give tell name etc are treated as wh-ueries (see an example table of query types in Appendix A inection 11)

ldquo

ie

uistic triples for combination of queries

In fact wh-queries are categorized depending on the equiv-lent intermediate representation For example in ldquowho are thecademics involved in the semantic webrdquo the generic tripleill be of the form 〈generic term relation second term〉 in our

xample 〈academics involved semantic web〉 A query withn equivalent triple representation is ldquowhich technologies hasMi producedrdquo where the triple will be technologies has pro-uced KMi〉 However a query like ldquoare there any Phd studentsn aktrdquo has another equivalent representation where the relations implicit or unknown 〈phd students akt〉 Other queries mayrovide little or no information about the type of the expectednswer eg ldquowhat is the job title of Johnrdquo or they can be justgeneric enquiry about someone or something eg ldquowho isanessardquo ldquowhat is an ontologyrdquo (see Fig 5)

12 Linguistic procedure for basic queries with clausesthree-terms queries)

Let us consider the request ldquoList all the projects in thenowledge media institute about the semantic webrdquo where bothin knowledge media instituterdquo and ldquoabout semantic webrdquo areodifiers (ie they modify the meaning of other syntactic con-

tituents) The problem here is to identify the constituent tohich each modifier has to be attached The Relation Similar-

ty Service (RSS) is responsible for resolving this ambiguityhrough the use of the ontology or in an interactive way bysking for user feedback The linguistic componentrsquos task isherefore to pass the ambiguity problem to the RSS through thentermediate representation as part of the translation process

We must remember that all these queries are always repre-ented by just one Query-Triple at the linguistic stage In thease that modifiers are presented the triple will consist of threeerms instead of two for instance in ldquoare there any planet newsritten by researchers from aktrdquo in which we get the terms

planet newsrdquo ldquoresearchersrdquo and ldquoaktrdquo (see Fig 6)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the basic queries with clauses asxplained in Section 52

s and

4

rtuiwn

tgt

FjaTwl

ldquowtls

ldquoat

aldquo

TsmO

ie

4e

atcficrFppota(AAr

V Lopez et al Web Semantics Science Service

13 Linguistic procedure for combination of queriesNevertheless a query can be a composition of two explicit

elationships between terms or a query can be a composition ofwo basic queries In this case the intermediate representationsually consists of two triples one triple per relationship Fornstance in ldquoare there any planet news written by researchersorking in aktrdquo where the two linguistic triples will be 〈planetews written researchers〉 and 〈which are working akt〉

Not only are these complex queries classified depending onhe kind of triples they generate but also the resultant triple cate-ories are the driving force to generate an answer by combininghe triples in an appropriate way

There are three ways in which queries can be combinedirstly queries can be combined by using a ldquoandrdquo or ldquoorrdquo con-

unction operator as in ldquowhich projects are funded by epsrc andre about semantic webrdquo This query will generate two Query-riples 〈funded (projects epsrc)〉 and 〈about (projects semanticeb)〉 and the subsequent answer will be a combination of both

ists obtained after resolving each tripleSecondly a query may be conditioned to a second query as in

which researchers wrote publications related to social aspectsrdquohere the second clause modifies one of the previous terms In

his example and at this stage ambiguity cannot be solved byinguistic procedures therefore the term to be modified by theecond clause remains uncertain

Finally we can have combinations of two basic patterns egwhat is the web address of Peter who works for aktrdquo or ldquowhichre the projects led by Enrico which are related to multimedia

echnologiesrdquo (see Fig 7)

Once the intermediate representation is created prepositionsnd auxiliary words such as ldquoinrdquo ldquoaboutrdquo ldquoofrdquo ldquoatrdquo ldquoforrdquobyrdquo ldquotherdquo ldquodoesrdquo ldquodordquo ldquodidrdquo are removed from the Query-

oVtp

Fig 8 Set of JAPE rules used in the quer

Agents on the World Wide Web 5 (2007) 72ndash105 79

riples because in the current AquaLog implementation theiremantics are not exploited to provide any additional infor-ation to help in the mapping between Query-Triples into thentology-compliant triples (Onto-Triples)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the combinations of queries asxplained in Section 53

2 The use of JAPE grammars in AquaLog illustrativexample

Consider the basic query in Fig 5 ldquowhat are the researchreas covered by the akt projectrdquo as an illustrative example inhe use of JAPE grammars to classify the query and obtain theorrect annotations to create a valid triple Two JAPE grammarles were written for AquaLog The first one in the GATE exe-ution list annotates the nouns ldquoresearchersrdquo and ldquoakt projectsrdquoelations ldquoarerdquo and ldquocovered byrdquo and query terms ldquowhatrdquo (seeig 8) For instance consider the rule RuleNP in Fig 8 Theattern on the left hand side (ie before ldquorarrrdquo) identifies nounhrases Noun phrases are word sequences that start with zeror more determiners (identified by the (DET) part of the pat-ern) Determiners can be followed by zero or more adverbsnd adjectives nouns or coordinating conjunctions in any orderidentified by the ((RB)ADJ)|NOUN|CC) part of the pattern)

noun phrase mandatorily finishes with a noun (NOUN) RBDJ NOUN and CC are macros and act as placeholders for other

ules For example the macro LIST used by the rule RuleQU1

r the macro VERB used by rule RuleREL1 in Fig 8 The macroERB contains the disjunction of seven patterns which means

hat the macro will fire if a word satisfies any of these sevenatterns eg last pattern identified words assigned to the VB

y example to generate annotations

80 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

e que

PaateR

(sTTQ

gntgs

5

tbtrEbtmdi

osq

tvplortoLtbnescss

maoaw

Fig 9 Set of JAPE rules used in th

OS tags POS tags are assigned in the ldquocategoryrdquo feature ofldquoTokenrdquo annotation used in GATE to encode the information

bout the analyzed question Any word sequence identified byhe left hand side of a rule can be referenced in its right hand sizeg relREL = rule = ldquoREL1rdquo is referenced by the macroELATION in the second grammar (Fig 9)

The second grammar file based on the previous annotationsrules) classifies the example query as ldquowh-generictermrdquo Theet of JAPE rules used to classify the query can be seen in Fig 9he pattern fired for that query ldquoQUWhich IS ARE TERM RELA-ION TERMrdquo is marked in bold This pattern relies on the macrosUWhich IS ARE TERM and RELATIONThanks to this architecture that takes advantage of the JAPE

rammars although we can still only deal with a subset ofatural language it is possible to extend this subset in a rela-ively easy way by updating the regular expressions in the JAPErammars This design choice ensures the easy portability of theystem with respect to both ontologies and natural languages

Relation similarity service from triples to answers

The RSS is the backbone of the question-answering sys-em The RSS component is invoked after the NL query haseen transformed into a term-relation form and classified intohe appropriate category The RSS is the main componentesponsible for generating an ontology-compliant logical queryssentially the RSS tries to make sense of the input queryy looking at the structure of the ontology and the informa-

ion stored in the target KBs as well as using string similarity

atching generic lexical resources such as WordNet and aomain-dependent lexicon obtained through the use of a Learn-ng Mechanism explained in Section 55

immi

ry example to classify the queries

An important aspect of the RSS is that it is interactive Inther words when ambiguity arises between two or more pos-ible terms or relations the user will be required to interpret theuery

Relation and concept names are identified and mapped withinhe ontology through the RSS and the Class Similarity Ser-ice (CSS) respectively (the CSS is a component which isart of the Relation Similarity Service) Syntactic techniquesike string algorithms are not useful when the equivalent ontol-gy terms have dissimilar labels from the user term Moreoverelations names can be considered as ldquosecond class citizensrdquohe rationality behind this is that mapping relations basedn its name (labels) is more difficult that mapping conceptsabels in the case of concepts often catch the meaning of

he semantic entity while in relations its meaning is giveny the type of its domain and its range rather than by itsame (typically vaguely defined as eg ldquorelated tordquo) with thexception of the cases in which the relations are presented inome ontologies as a concept (eg has-author modeled as aoncept Author in a given ontology) Therefore the similarityervices should use the ontology semantics to deal with theseituations

Proper names instead are normally mapped into instances byeans of string metric algorithms The disadvantage of such an

pproach is that without some facilities for dealing with unrec-gnized terms some questions are not accepted For instancequestion like ldquois there any researcher called Thompsonrdquoould not be accepted if Thompson is not identified as an

nstance in the current KB However a partial solution is imple-ented for affirmativenegative types of questions where weake sense of questions in which only one of two instances

s recognized eg in the query ldquois Enrico working in ibmrdquo

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 5: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

7 s and

ewLtirLiofda

htssaanc

cldquoopf

ptcwt

3

scqTmtawdttto

sttll

aiApcatwbbcwads

o

4

toasirtlLG

atEktnvgeaciqceoi

When developing AquaLog we extended the set of annota-tions returned by GATE by identifying noun terms relations4

question indicators (whichwhowhen etc) and patterns or types

6 V Lopez et al Web Semantics Science Service

To reduce the number of callsrequests to the target knowl-dge base and to guarantee real-time question answering evenhen multiple users access the server simultaneously the Aqua-og server accesses and caches basic indexing data from the

arget Knowledge Bases (KBs) at initialization time The cachednformation in the server can be efficiently accessed by theemote clients In our research department KMi since Aqua-og is also integrated into the KMi Semantic Portal [35] the KB

s dynamically and constantly changing with new informationbtained from KMi websites through different agents There-ore a mechanism is provided to update the cached indexingata on the AquaLog server This mechanism is called by thesegents when they update the KB

As regards portability the only entry points which requireuman intervention for adapting AquaLog to a new domain arehe configuration files Through the configuration files it is pos-ible to specify the parameters needed to initialize the AquaLogerver The most important parameters are the ontology namend server login details if necessary the name of the plug-innd slots that correspond to pretty names Pretty names are alter-ative names that an instance may have Optionally the mainoncepts of interest in an ontology can be specified

The other ontology-dependent parameters required in theonfiguration files are the ones which specify that the termwhordquo ldquowhererdquo ldquowhenrdquo corresponds for example to thentology terms ldquopersonorganizationrdquo ldquolocationrdquo and ldquotime-ositionrdquo respectively The independence of the applicationrom the ontology is guaranteed through these parameters

There is no single strategy by which a query can be inter-reted because this process is highly dependent on the type ofhe query So in the next sections we will not attempt to give aomprehensive overview of all the possible strategies but rathere will show a few examples which are illustrative of the way

he different AquaLog components work together

1 AquaLog triple approach

Core to the overall architecture is the triple-based data repre-entation approach A major challenge in the development of theurrent version of AquaLog is to efficiently deal with complexueries in which there could be more than one or two termshese terms may take the form of modifiers that change theeaning of other syntactic constituents and they can be mapped

o instances classes values or combinations of them in compli-nce with the ontology to which they subscribe Moreover theider expressiveness adds another layer of complexity mainlyue to the ambiguous nature of human language So naturallyhe question arises of how far the engineering functionality ofhe triple-based model can be extended to map a triple obtainedhrough a NL query into a triple that can be realized by a givenntology

To accomplish this task the triple in the current AquaLog ver-ion has been slightly extended Therefore an existing linguistic

riple now consists of one two or even three terms connectedhrough a relationship A query can be translated into one or moreinguistic-triples and then each linguistic triple can be trans-ated into one or more ontology-compliant-triples Each triple

t

u

Agents on the World Wide Web 5 (2007) 72ndash105

lso has additional lexical features in order to facilitate reason-ng about the answer such as the voice and tense of the relationnother key feature for each triple is its category (see an exam-le table of query types in Appendix A in Section 11) Theseategories identify different basic structures of the NL querynd their equivalent representation in the triple Depending onhe category the triple tells us how to deal with its elementshat inference process is required and what kind of answer cane expected For instance different queries may be representedy triples of the same category since in natural language therean be different ways of asking the same question ie ldquowhoorks in aktrdquo3 and ldquoShow me all researchers involved in the

kt projectrdquo The classification of the triple may be modifieduring its life cycle in compliance with the target ontology itubscribes to

In what follows we provide an overview of each componentf the AquaLog architecture

Linguistic component from questions to query triples

When a query is asked the Linguistic Componentrsquos task is toranslate from NL to the triple format used to query the ontol-gy (Query-Triples) This preprocessing step helps towards theccurate classification of the query and its components by usingtandard research tools and query classification Classificationdentifies the type of question and hence the kind of answerequired In particular AquaLog uses the GATE [1549] infras-ructure and resources (language resources processing resourcesike ANNIE serial controllers pipelines etc) as part of theinguistic Component Communication between AquaLog andATE takes place through the standard GATE APIAfter the execution of the GATE controller a set of syntactic

nnotations associated with the input query are returned In par-icular AquaLog makes use of GATE processing resources fornglish tokenizer sentence splitter POS tagger and VP chun-er The annotations returned after the sequential execution ofhese resources include information about sentences tokensouns and verbs For example we get voice and tense for theerbs and categories for the nouns such as determinant sin-ularplural conjunction possessive determiner prepositionxistential wh-determiner etc These features returned for eachnnotation are important to create the triples for instance in thease of verb annotations together with the voice and sense it alsoncludes information that indicates if it is the main verb of theuery (the one that separates the nominal group and the predi-ate) This information is used by the Linguistic Component tostablish the proper attachment of prepositional phrases previ-us to the creation of the Query-Triples (an example is shownn Section 53)

3 AKT is a EPSRC founded project in which the Open University is one ofhe partners httpwwwaktorsorgakt

4 Hence for example we exploited the fact that natural language commonlyses a preposition to express a relationship For instance in the query ldquowho has

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 77

Fig 5 Example of GATE annotations and linguistic triples for basic queries

lingu

oGfrpgetdtgcoF

mo

ac

t

Aaosot

tlRspQt

Fig 6 Example of GATE annotations and

f questions This is achieved by running through the use ofATE JAPE transducers a set of JAPE grammars that we wrote

or AquaLog JAPE is an expressive regular expression basedule language offered by GATE5 (see Section 42 for an exam-le of the JAPE grammars written for AquaLog) These JAPErammars consist of a set of phases that run sequentially andach phase is defined as a set of pattern rules which allow uso recognize regular expressions using previous annotations inocuments In other words the JAPE grammarsrsquo power lie inheir ability to regard the data stored in the GATE annotationraphs as simple sequences which can be matched deterministi-ally by using regular expressions6 Examples of the annotationbtained after the use of our JAPE grammars can be seen inigs 5ndash7

Currently the linguistic component through the JAPE gram-ars dynamically identifies around 14 different question typesr intermediate representations (examples can be seen in

research interest in semantic services in KMirdquo the relation of the query is aombination of the verb ldquohasrdquo and the noun ldquoresearch interestrdquo5 httpgateacuksaletaoindexhtml6 LC binaries and JAPE grammars can be downloaded httpkmiopenacuk

echnologiesaqualog

etgqapa

d

istic triples for basic queries with clauses

ppendix A of this paper) including basic queries requiring anffirmationnegation or a description as an answer or the big setf queries constituted by a wh-question like ldquoare there any phdtudents in dotkomrdquo where the relation is implicit or unknownr ldquowhich is the job title of Johnrdquo where no information abouthe type of the expected answer is provided etc

Categories not only tell us the kind of solution that needso be achieved but also they give an indication of the mostikely common problems that the Linguistic Component andelation Similarity Services will need to deal with to under-

tand this particular NL query and in consequence it guides therocess of creating the equivalent intermediate representation oruery-Triple For instance in the previous example ldquowhat are

he research areas covered by the akt projectrdquo thanks to the cat-gory the Linguistic Component knows that it has to create oneriple that represents a explicit relationship between an expliciteneric term and a second term It gets the annotations for theuery terms relations and nouns preprocesses them and gener-tes the following Query-Triple 〈research areas covered akt

roject〉 in which ldquoresearch areasrdquo has correctly been identifieds the query term (instead of ldquowhatrdquo)

For the intermediate representation we use the triple-basedata model rather than logic mainly because at this stage we do

78 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

d ling

nrpRtpct

4

nb4eoq

awdopa

4

ttDbstcqS

aaweaKdiipaaV

4(

kldquomswitati

sctw

Fig 7 Example of GATE annotations an

ot have to worry about getting the representation completelyight The role of the intermediate representation is simply torovide an easy and sufficient way to manipulate input for theSS An alternative would be to use more complex linguis-

ic parsing but as discussed by Katz et al [3329] althougharse trees (as for example the NLP parser for Stanford [34])apture syntactic relations and dependencies they are oftenime-consuming and difficult to manipulate

1 Classification of questions

Here we present the different kind of questions and the ratio-ale behind distinguishing these categories We are dealing withasic queries (Section 411) basic queries with clauses (Section12) and combinations of queries (Section 413) We consid-red these three main groups of queries based on the numberf triples needed to generate an equivalent representation of theuery

It is important to emphasize that at this stage all the termsre still strings or arrays of strings without any correspondenceith the ontology This is because the analysis is completelyomain independent and is entirely based on the GATE analysisf the English language The Query-Triple is only a formal sim-lified way of representing the NL-query Each triple representsn explicit relationship between terms

11 Linguistic procedure for basic queriesThere are several different types of basic queries we can men-

ion the affirmative negative query type which are those querieshat requires an affirmation or negation as an answer ie ldquois Johnomingue a phd studentrdquo ldquois Motta working in ibmrdquo Anotherig set of queries are those constituted by a ldquowh-questionrdquouch as the ones starting with what who when where are

here any does anybodyanyone or how many Also imperativeommands like list give tell name etc are treated as wh-ueries (see an example table of query types in Appendix A inection 11)

ldquo

ie

uistic triples for combination of queries

In fact wh-queries are categorized depending on the equiv-lent intermediate representation For example in ldquowho are thecademics involved in the semantic webrdquo the generic tripleill be of the form 〈generic term relation second term〉 in our

xample 〈academics involved semantic web〉 A query withn equivalent triple representation is ldquowhich technologies hasMi producedrdquo where the triple will be technologies has pro-uced KMi〉 However a query like ldquoare there any Phd studentsn aktrdquo has another equivalent representation where the relations implicit or unknown 〈phd students akt〉 Other queries mayrovide little or no information about the type of the expectednswer eg ldquowhat is the job title of Johnrdquo or they can be justgeneric enquiry about someone or something eg ldquowho isanessardquo ldquowhat is an ontologyrdquo (see Fig 5)

12 Linguistic procedure for basic queries with clausesthree-terms queries)

Let us consider the request ldquoList all the projects in thenowledge media institute about the semantic webrdquo where bothin knowledge media instituterdquo and ldquoabout semantic webrdquo areodifiers (ie they modify the meaning of other syntactic con-

tituents) The problem here is to identify the constituent tohich each modifier has to be attached The Relation Similar-

ty Service (RSS) is responsible for resolving this ambiguityhrough the use of the ontology or in an interactive way bysking for user feedback The linguistic componentrsquos task isherefore to pass the ambiguity problem to the RSS through thentermediate representation as part of the translation process

We must remember that all these queries are always repre-ented by just one Query-Triple at the linguistic stage In thease that modifiers are presented the triple will consist of threeerms instead of two for instance in ldquoare there any planet newsritten by researchers from aktrdquo in which we get the terms

planet newsrdquo ldquoresearchersrdquo and ldquoaktrdquo (see Fig 6)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the basic queries with clauses asxplained in Section 52

s and

4

rtuiwn

tgt

FjaTwl

ldquowtls

ldquoat

aldquo

TsmO

ie

4e

atcficrFppota(AAr

V Lopez et al Web Semantics Science Service

13 Linguistic procedure for combination of queriesNevertheless a query can be a composition of two explicit

elationships between terms or a query can be a composition ofwo basic queries In this case the intermediate representationsually consists of two triples one triple per relationship Fornstance in ldquoare there any planet news written by researchersorking in aktrdquo where the two linguistic triples will be 〈planetews written researchers〉 and 〈which are working akt〉

Not only are these complex queries classified depending onhe kind of triples they generate but also the resultant triple cate-ories are the driving force to generate an answer by combininghe triples in an appropriate way

There are three ways in which queries can be combinedirstly queries can be combined by using a ldquoandrdquo or ldquoorrdquo con-

unction operator as in ldquowhich projects are funded by epsrc andre about semantic webrdquo This query will generate two Query-riples 〈funded (projects epsrc)〉 and 〈about (projects semanticeb)〉 and the subsequent answer will be a combination of both

ists obtained after resolving each tripleSecondly a query may be conditioned to a second query as in

which researchers wrote publications related to social aspectsrdquohere the second clause modifies one of the previous terms In

his example and at this stage ambiguity cannot be solved byinguistic procedures therefore the term to be modified by theecond clause remains uncertain

Finally we can have combinations of two basic patterns egwhat is the web address of Peter who works for aktrdquo or ldquowhichre the projects led by Enrico which are related to multimedia

echnologiesrdquo (see Fig 7)

Once the intermediate representation is created prepositionsnd auxiliary words such as ldquoinrdquo ldquoaboutrdquo ldquoofrdquo ldquoatrdquo ldquoforrdquobyrdquo ldquotherdquo ldquodoesrdquo ldquodordquo ldquodidrdquo are removed from the Query-

oVtp

Fig 8 Set of JAPE rules used in the quer

Agents on the World Wide Web 5 (2007) 72ndash105 79

riples because in the current AquaLog implementation theiremantics are not exploited to provide any additional infor-ation to help in the mapping between Query-Triples into thentology-compliant triples (Onto-Triples)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the combinations of queries asxplained in Section 53

2 The use of JAPE grammars in AquaLog illustrativexample

Consider the basic query in Fig 5 ldquowhat are the researchreas covered by the akt projectrdquo as an illustrative example inhe use of JAPE grammars to classify the query and obtain theorrect annotations to create a valid triple Two JAPE grammarles were written for AquaLog The first one in the GATE exe-ution list annotates the nouns ldquoresearchersrdquo and ldquoakt projectsrdquoelations ldquoarerdquo and ldquocovered byrdquo and query terms ldquowhatrdquo (seeig 8) For instance consider the rule RuleNP in Fig 8 Theattern on the left hand side (ie before ldquorarrrdquo) identifies nounhrases Noun phrases are word sequences that start with zeror more determiners (identified by the (DET) part of the pat-ern) Determiners can be followed by zero or more adverbsnd adjectives nouns or coordinating conjunctions in any orderidentified by the ((RB)ADJ)|NOUN|CC) part of the pattern)

noun phrase mandatorily finishes with a noun (NOUN) RBDJ NOUN and CC are macros and act as placeholders for other

ules For example the macro LIST used by the rule RuleQU1

r the macro VERB used by rule RuleREL1 in Fig 8 The macroERB contains the disjunction of seven patterns which means

hat the macro will fire if a word satisfies any of these sevenatterns eg last pattern identified words assigned to the VB

y example to generate annotations

80 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

e que

PaateR

(sTTQ

gntgs

5

tbtrEbtmdi

osq

tvplortoLtbnescss

maoaw

Fig 9 Set of JAPE rules used in th

OS tags POS tags are assigned in the ldquocategoryrdquo feature ofldquoTokenrdquo annotation used in GATE to encode the information

bout the analyzed question Any word sequence identified byhe left hand side of a rule can be referenced in its right hand sizeg relREL = rule = ldquoREL1rdquo is referenced by the macroELATION in the second grammar (Fig 9)

The second grammar file based on the previous annotationsrules) classifies the example query as ldquowh-generictermrdquo Theet of JAPE rules used to classify the query can be seen in Fig 9he pattern fired for that query ldquoQUWhich IS ARE TERM RELA-ION TERMrdquo is marked in bold This pattern relies on the macrosUWhich IS ARE TERM and RELATIONThanks to this architecture that takes advantage of the JAPE

rammars although we can still only deal with a subset ofatural language it is possible to extend this subset in a rela-ively easy way by updating the regular expressions in the JAPErammars This design choice ensures the easy portability of theystem with respect to both ontologies and natural languages

Relation similarity service from triples to answers

The RSS is the backbone of the question-answering sys-em The RSS component is invoked after the NL query haseen transformed into a term-relation form and classified intohe appropriate category The RSS is the main componentesponsible for generating an ontology-compliant logical queryssentially the RSS tries to make sense of the input queryy looking at the structure of the ontology and the informa-

ion stored in the target KBs as well as using string similarity

atching generic lexical resources such as WordNet and aomain-dependent lexicon obtained through the use of a Learn-ng Mechanism explained in Section 55

immi

ry example to classify the queries

An important aspect of the RSS is that it is interactive Inther words when ambiguity arises between two or more pos-ible terms or relations the user will be required to interpret theuery

Relation and concept names are identified and mapped withinhe ontology through the RSS and the Class Similarity Ser-ice (CSS) respectively (the CSS is a component which isart of the Relation Similarity Service) Syntactic techniquesike string algorithms are not useful when the equivalent ontol-gy terms have dissimilar labels from the user term Moreoverelations names can be considered as ldquosecond class citizensrdquohe rationality behind this is that mapping relations basedn its name (labels) is more difficult that mapping conceptsabels in the case of concepts often catch the meaning of

he semantic entity while in relations its meaning is giveny the type of its domain and its range rather than by itsame (typically vaguely defined as eg ldquorelated tordquo) with thexception of the cases in which the relations are presented inome ontologies as a concept (eg has-author modeled as aoncept Author in a given ontology) Therefore the similarityervices should use the ontology semantics to deal with theseituations

Proper names instead are normally mapped into instances byeans of string metric algorithms The disadvantage of such an

pproach is that without some facilities for dealing with unrec-gnized terms some questions are not accepted For instancequestion like ldquois there any researcher called Thompsonrdquoould not be accepted if Thompson is not identified as an

nstance in the current KB However a partial solution is imple-ented for affirmativenegative types of questions where weake sense of questions in which only one of two instances

s recognized eg in the query ldquois Enrico working in ibmrdquo

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 6: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 77

Fig 5 Example of GATE annotations and linguistic triples for basic queries

lingu

oGfrpgetdtgcoF

mo

ac

t

Aaosot

tlRspQt

Fig 6 Example of GATE annotations and

f questions This is achieved by running through the use ofATE JAPE transducers a set of JAPE grammars that we wrote

or AquaLog JAPE is an expressive regular expression basedule language offered by GATE5 (see Section 42 for an exam-le of the JAPE grammars written for AquaLog) These JAPErammars consist of a set of phases that run sequentially andach phase is defined as a set of pattern rules which allow uso recognize regular expressions using previous annotations inocuments In other words the JAPE grammarsrsquo power lie inheir ability to regard the data stored in the GATE annotationraphs as simple sequences which can be matched deterministi-ally by using regular expressions6 Examples of the annotationbtained after the use of our JAPE grammars can be seen inigs 5ndash7

Currently the linguistic component through the JAPE gram-ars dynamically identifies around 14 different question typesr intermediate representations (examples can be seen in

research interest in semantic services in KMirdquo the relation of the query is aombination of the verb ldquohasrdquo and the noun ldquoresearch interestrdquo5 httpgateacuksaletaoindexhtml6 LC binaries and JAPE grammars can be downloaded httpkmiopenacuk

echnologiesaqualog

etgqapa

d

istic triples for basic queries with clauses

ppendix A of this paper) including basic queries requiring anffirmationnegation or a description as an answer or the big setf queries constituted by a wh-question like ldquoare there any phdtudents in dotkomrdquo where the relation is implicit or unknownr ldquowhich is the job title of Johnrdquo where no information abouthe type of the expected answer is provided etc

Categories not only tell us the kind of solution that needso be achieved but also they give an indication of the mostikely common problems that the Linguistic Component andelation Similarity Services will need to deal with to under-

tand this particular NL query and in consequence it guides therocess of creating the equivalent intermediate representation oruery-Triple For instance in the previous example ldquowhat are

he research areas covered by the akt projectrdquo thanks to the cat-gory the Linguistic Component knows that it has to create oneriple that represents a explicit relationship between an expliciteneric term and a second term It gets the annotations for theuery terms relations and nouns preprocesses them and gener-tes the following Query-Triple 〈research areas covered akt

roject〉 in which ldquoresearch areasrdquo has correctly been identifieds the query term (instead of ldquowhatrdquo)

For the intermediate representation we use the triple-basedata model rather than logic mainly because at this stage we do

78 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

d ling

nrpRtpct

4

nb4eoq

awdopa

4

ttDbstcqS

aaweaKdiipaaV

4(

kldquomswitati

sctw

Fig 7 Example of GATE annotations an

ot have to worry about getting the representation completelyight The role of the intermediate representation is simply torovide an easy and sufficient way to manipulate input for theSS An alternative would be to use more complex linguis-

ic parsing but as discussed by Katz et al [3329] althougharse trees (as for example the NLP parser for Stanford [34])apture syntactic relations and dependencies they are oftenime-consuming and difficult to manipulate

1 Classification of questions

Here we present the different kind of questions and the ratio-ale behind distinguishing these categories We are dealing withasic queries (Section 411) basic queries with clauses (Section12) and combinations of queries (Section 413) We consid-red these three main groups of queries based on the numberf triples needed to generate an equivalent representation of theuery

It is important to emphasize that at this stage all the termsre still strings or arrays of strings without any correspondenceith the ontology This is because the analysis is completelyomain independent and is entirely based on the GATE analysisf the English language The Query-Triple is only a formal sim-lified way of representing the NL-query Each triple representsn explicit relationship between terms

11 Linguistic procedure for basic queriesThere are several different types of basic queries we can men-

ion the affirmative negative query type which are those querieshat requires an affirmation or negation as an answer ie ldquois Johnomingue a phd studentrdquo ldquois Motta working in ibmrdquo Anotherig set of queries are those constituted by a ldquowh-questionrdquouch as the ones starting with what who when where are

here any does anybodyanyone or how many Also imperativeommands like list give tell name etc are treated as wh-ueries (see an example table of query types in Appendix A inection 11)

ldquo

ie

uistic triples for combination of queries

In fact wh-queries are categorized depending on the equiv-lent intermediate representation For example in ldquowho are thecademics involved in the semantic webrdquo the generic tripleill be of the form 〈generic term relation second term〉 in our

xample 〈academics involved semantic web〉 A query withn equivalent triple representation is ldquowhich technologies hasMi producedrdquo where the triple will be technologies has pro-uced KMi〉 However a query like ldquoare there any Phd studentsn aktrdquo has another equivalent representation where the relations implicit or unknown 〈phd students akt〉 Other queries mayrovide little or no information about the type of the expectednswer eg ldquowhat is the job title of Johnrdquo or they can be justgeneric enquiry about someone or something eg ldquowho isanessardquo ldquowhat is an ontologyrdquo (see Fig 5)

12 Linguistic procedure for basic queries with clausesthree-terms queries)

Let us consider the request ldquoList all the projects in thenowledge media institute about the semantic webrdquo where bothin knowledge media instituterdquo and ldquoabout semantic webrdquo areodifiers (ie they modify the meaning of other syntactic con-

tituents) The problem here is to identify the constituent tohich each modifier has to be attached The Relation Similar-

ty Service (RSS) is responsible for resolving this ambiguityhrough the use of the ontology or in an interactive way bysking for user feedback The linguistic componentrsquos task isherefore to pass the ambiguity problem to the RSS through thentermediate representation as part of the translation process

We must remember that all these queries are always repre-ented by just one Query-Triple at the linguistic stage In thease that modifiers are presented the triple will consist of threeerms instead of two for instance in ldquoare there any planet newsritten by researchers from aktrdquo in which we get the terms

planet newsrdquo ldquoresearchersrdquo and ldquoaktrdquo (see Fig 6)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the basic queries with clauses asxplained in Section 52

s and

4

rtuiwn

tgt

FjaTwl

ldquowtls

ldquoat

aldquo

TsmO

ie

4e

atcficrFppota(AAr

V Lopez et al Web Semantics Science Service

13 Linguistic procedure for combination of queriesNevertheless a query can be a composition of two explicit

elationships between terms or a query can be a composition ofwo basic queries In this case the intermediate representationsually consists of two triples one triple per relationship Fornstance in ldquoare there any planet news written by researchersorking in aktrdquo where the two linguistic triples will be 〈planetews written researchers〉 and 〈which are working akt〉

Not only are these complex queries classified depending onhe kind of triples they generate but also the resultant triple cate-ories are the driving force to generate an answer by combininghe triples in an appropriate way

There are three ways in which queries can be combinedirstly queries can be combined by using a ldquoandrdquo or ldquoorrdquo con-

unction operator as in ldquowhich projects are funded by epsrc andre about semantic webrdquo This query will generate two Query-riples 〈funded (projects epsrc)〉 and 〈about (projects semanticeb)〉 and the subsequent answer will be a combination of both

ists obtained after resolving each tripleSecondly a query may be conditioned to a second query as in

which researchers wrote publications related to social aspectsrdquohere the second clause modifies one of the previous terms In

his example and at this stage ambiguity cannot be solved byinguistic procedures therefore the term to be modified by theecond clause remains uncertain

Finally we can have combinations of two basic patterns egwhat is the web address of Peter who works for aktrdquo or ldquowhichre the projects led by Enrico which are related to multimedia

echnologiesrdquo (see Fig 7)

Once the intermediate representation is created prepositionsnd auxiliary words such as ldquoinrdquo ldquoaboutrdquo ldquoofrdquo ldquoatrdquo ldquoforrdquobyrdquo ldquotherdquo ldquodoesrdquo ldquodordquo ldquodidrdquo are removed from the Query-

oVtp

Fig 8 Set of JAPE rules used in the quer

Agents on the World Wide Web 5 (2007) 72ndash105 79

riples because in the current AquaLog implementation theiremantics are not exploited to provide any additional infor-ation to help in the mapping between Query-Triples into thentology-compliant triples (Onto-Triples)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the combinations of queries asxplained in Section 53

2 The use of JAPE grammars in AquaLog illustrativexample

Consider the basic query in Fig 5 ldquowhat are the researchreas covered by the akt projectrdquo as an illustrative example inhe use of JAPE grammars to classify the query and obtain theorrect annotations to create a valid triple Two JAPE grammarles were written for AquaLog The first one in the GATE exe-ution list annotates the nouns ldquoresearchersrdquo and ldquoakt projectsrdquoelations ldquoarerdquo and ldquocovered byrdquo and query terms ldquowhatrdquo (seeig 8) For instance consider the rule RuleNP in Fig 8 Theattern on the left hand side (ie before ldquorarrrdquo) identifies nounhrases Noun phrases are word sequences that start with zeror more determiners (identified by the (DET) part of the pat-ern) Determiners can be followed by zero or more adverbsnd adjectives nouns or coordinating conjunctions in any orderidentified by the ((RB)ADJ)|NOUN|CC) part of the pattern)

noun phrase mandatorily finishes with a noun (NOUN) RBDJ NOUN and CC are macros and act as placeholders for other

ules For example the macro LIST used by the rule RuleQU1

r the macro VERB used by rule RuleREL1 in Fig 8 The macroERB contains the disjunction of seven patterns which means

hat the macro will fire if a word satisfies any of these sevenatterns eg last pattern identified words assigned to the VB

y example to generate annotations

80 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

e que

PaateR

(sTTQ

gntgs

5

tbtrEbtmdi

osq

tvplortoLtbnescss

maoaw

Fig 9 Set of JAPE rules used in th

OS tags POS tags are assigned in the ldquocategoryrdquo feature ofldquoTokenrdquo annotation used in GATE to encode the information

bout the analyzed question Any word sequence identified byhe left hand side of a rule can be referenced in its right hand sizeg relREL = rule = ldquoREL1rdquo is referenced by the macroELATION in the second grammar (Fig 9)

The second grammar file based on the previous annotationsrules) classifies the example query as ldquowh-generictermrdquo Theet of JAPE rules used to classify the query can be seen in Fig 9he pattern fired for that query ldquoQUWhich IS ARE TERM RELA-ION TERMrdquo is marked in bold This pattern relies on the macrosUWhich IS ARE TERM and RELATIONThanks to this architecture that takes advantage of the JAPE

rammars although we can still only deal with a subset ofatural language it is possible to extend this subset in a rela-ively easy way by updating the regular expressions in the JAPErammars This design choice ensures the easy portability of theystem with respect to both ontologies and natural languages

Relation similarity service from triples to answers

The RSS is the backbone of the question-answering sys-em The RSS component is invoked after the NL query haseen transformed into a term-relation form and classified intohe appropriate category The RSS is the main componentesponsible for generating an ontology-compliant logical queryssentially the RSS tries to make sense of the input queryy looking at the structure of the ontology and the informa-

ion stored in the target KBs as well as using string similarity

atching generic lexical resources such as WordNet and aomain-dependent lexicon obtained through the use of a Learn-ng Mechanism explained in Section 55

immi

ry example to classify the queries

An important aspect of the RSS is that it is interactive Inther words when ambiguity arises between two or more pos-ible terms or relations the user will be required to interpret theuery

Relation and concept names are identified and mapped withinhe ontology through the RSS and the Class Similarity Ser-ice (CSS) respectively (the CSS is a component which isart of the Relation Similarity Service) Syntactic techniquesike string algorithms are not useful when the equivalent ontol-gy terms have dissimilar labels from the user term Moreoverelations names can be considered as ldquosecond class citizensrdquohe rationality behind this is that mapping relations basedn its name (labels) is more difficult that mapping conceptsabels in the case of concepts often catch the meaning of

he semantic entity while in relations its meaning is giveny the type of its domain and its range rather than by itsame (typically vaguely defined as eg ldquorelated tordquo) with thexception of the cases in which the relations are presented inome ontologies as a concept (eg has-author modeled as aoncept Author in a given ontology) Therefore the similarityervices should use the ontology semantics to deal with theseituations

Proper names instead are normally mapped into instances byeans of string metric algorithms The disadvantage of such an

pproach is that without some facilities for dealing with unrec-gnized terms some questions are not accepted For instancequestion like ldquois there any researcher called Thompsonrdquoould not be accepted if Thompson is not identified as an

nstance in the current KB However a partial solution is imple-ented for affirmativenegative types of questions where weake sense of questions in which only one of two instances

s recognized eg in the query ldquois Enrico working in ibmrdquo

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 7: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

78 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

d ling

nrpRtpct

4

nb4eoq

awdopa

4

ttDbstcqS

aaweaKdiipaaV

4(

kldquomswitati

sctw

Fig 7 Example of GATE annotations an

ot have to worry about getting the representation completelyight The role of the intermediate representation is simply torovide an easy and sufficient way to manipulate input for theSS An alternative would be to use more complex linguis-

ic parsing but as discussed by Katz et al [3329] althougharse trees (as for example the NLP parser for Stanford [34])apture syntactic relations and dependencies they are oftenime-consuming and difficult to manipulate

1 Classification of questions

Here we present the different kind of questions and the ratio-ale behind distinguishing these categories We are dealing withasic queries (Section 411) basic queries with clauses (Section12) and combinations of queries (Section 413) We consid-red these three main groups of queries based on the numberf triples needed to generate an equivalent representation of theuery

It is important to emphasize that at this stage all the termsre still strings or arrays of strings without any correspondenceith the ontology This is because the analysis is completelyomain independent and is entirely based on the GATE analysisf the English language The Query-Triple is only a formal sim-lified way of representing the NL-query Each triple representsn explicit relationship between terms

11 Linguistic procedure for basic queriesThere are several different types of basic queries we can men-

ion the affirmative negative query type which are those querieshat requires an affirmation or negation as an answer ie ldquois Johnomingue a phd studentrdquo ldquois Motta working in ibmrdquo Anotherig set of queries are those constituted by a ldquowh-questionrdquouch as the ones starting with what who when where are

here any does anybodyanyone or how many Also imperativeommands like list give tell name etc are treated as wh-ueries (see an example table of query types in Appendix A inection 11)

ldquo

ie

uistic triples for combination of queries

In fact wh-queries are categorized depending on the equiv-lent intermediate representation For example in ldquowho are thecademics involved in the semantic webrdquo the generic tripleill be of the form 〈generic term relation second term〉 in our

xample 〈academics involved semantic web〉 A query withn equivalent triple representation is ldquowhich technologies hasMi producedrdquo where the triple will be technologies has pro-uced KMi〉 However a query like ldquoare there any Phd studentsn aktrdquo has another equivalent representation where the relations implicit or unknown 〈phd students akt〉 Other queries mayrovide little or no information about the type of the expectednswer eg ldquowhat is the job title of Johnrdquo or they can be justgeneric enquiry about someone or something eg ldquowho isanessardquo ldquowhat is an ontologyrdquo (see Fig 5)

12 Linguistic procedure for basic queries with clausesthree-terms queries)

Let us consider the request ldquoList all the projects in thenowledge media institute about the semantic webrdquo where bothin knowledge media instituterdquo and ldquoabout semantic webrdquo areodifiers (ie they modify the meaning of other syntactic con-

tituents) The problem here is to identify the constituent tohich each modifier has to be attached The Relation Similar-

ty Service (RSS) is responsible for resolving this ambiguityhrough the use of the ontology or in an interactive way bysking for user feedback The linguistic componentrsquos task isherefore to pass the ambiguity problem to the RSS through thentermediate representation as part of the translation process

We must remember that all these queries are always repre-ented by just one Query-Triple at the linguistic stage In thease that modifiers are presented the triple will consist of threeerms instead of two for instance in ldquoare there any planet newsritten by researchers from aktrdquo in which we get the terms

planet newsrdquo ldquoresearchersrdquo and ldquoaktrdquo (see Fig 6)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the basic queries with clauses asxplained in Section 52

s and

4

rtuiwn

tgt

FjaTwl

ldquowtls

ldquoat

aldquo

TsmO

ie

4e

atcficrFppota(AAr

V Lopez et al Web Semantics Science Service

13 Linguistic procedure for combination of queriesNevertheless a query can be a composition of two explicit

elationships between terms or a query can be a composition ofwo basic queries In this case the intermediate representationsually consists of two triples one triple per relationship Fornstance in ldquoare there any planet news written by researchersorking in aktrdquo where the two linguistic triples will be 〈planetews written researchers〉 and 〈which are working akt〉

Not only are these complex queries classified depending onhe kind of triples they generate but also the resultant triple cate-ories are the driving force to generate an answer by combininghe triples in an appropriate way

There are three ways in which queries can be combinedirstly queries can be combined by using a ldquoandrdquo or ldquoorrdquo con-

unction operator as in ldquowhich projects are funded by epsrc andre about semantic webrdquo This query will generate two Query-riples 〈funded (projects epsrc)〉 and 〈about (projects semanticeb)〉 and the subsequent answer will be a combination of both

ists obtained after resolving each tripleSecondly a query may be conditioned to a second query as in

which researchers wrote publications related to social aspectsrdquohere the second clause modifies one of the previous terms In

his example and at this stage ambiguity cannot be solved byinguistic procedures therefore the term to be modified by theecond clause remains uncertain

Finally we can have combinations of two basic patterns egwhat is the web address of Peter who works for aktrdquo or ldquowhichre the projects led by Enrico which are related to multimedia

echnologiesrdquo (see Fig 7)

Once the intermediate representation is created prepositionsnd auxiliary words such as ldquoinrdquo ldquoaboutrdquo ldquoofrdquo ldquoatrdquo ldquoforrdquobyrdquo ldquotherdquo ldquodoesrdquo ldquodordquo ldquodidrdquo are removed from the Query-

oVtp

Fig 8 Set of JAPE rules used in the quer

Agents on the World Wide Web 5 (2007) 72ndash105 79

riples because in the current AquaLog implementation theiremantics are not exploited to provide any additional infor-ation to help in the mapping between Query-Triples into thentology-compliant triples (Onto-Triples)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the combinations of queries asxplained in Section 53

2 The use of JAPE grammars in AquaLog illustrativexample

Consider the basic query in Fig 5 ldquowhat are the researchreas covered by the akt projectrdquo as an illustrative example inhe use of JAPE grammars to classify the query and obtain theorrect annotations to create a valid triple Two JAPE grammarles were written for AquaLog The first one in the GATE exe-ution list annotates the nouns ldquoresearchersrdquo and ldquoakt projectsrdquoelations ldquoarerdquo and ldquocovered byrdquo and query terms ldquowhatrdquo (seeig 8) For instance consider the rule RuleNP in Fig 8 Theattern on the left hand side (ie before ldquorarrrdquo) identifies nounhrases Noun phrases are word sequences that start with zeror more determiners (identified by the (DET) part of the pat-ern) Determiners can be followed by zero or more adverbsnd adjectives nouns or coordinating conjunctions in any orderidentified by the ((RB)ADJ)|NOUN|CC) part of the pattern)

noun phrase mandatorily finishes with a noun (NOUN) RBDJ NOUN and CC are macros and act as placeholders for other

ules For example the macro LIST used by the rule RuleQU1

r the macro VERB used by rule RuleREL1 in Fig 8 The macroERB contains the disjunction of seven patterns which means

hat the macro will fire if a word satisfies any of these sevenatterns eg last pattern identified words assigned to the VB

y example to generate annotations

80 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

e que

PaateR

(sTTQ

gntgs

5

tbtrEbtmdi

osq

tvplortoLtbnescss

maoaw

Fig 9 Set of JAPE rules used in th

OS tags POS tags are assigned in the ldquocategoryrdquo feature ofldquoTokenrdquo annotation used in GATE to encode the information

bout the analyzed question Any word sequence identified byhe left hand side of a rule can be referenced in its right hand sizeg relREL = rule = ldquoREL1rdquo is referenced by the macroELATION in the second grammar (Fig 9)

The second grammar file based on the previous annotationsrules) classifies the example query as ldquowh-generictermrdquo Theet of JAPE rules used to classify the query can be seen in Fig 9he pattern fired for that query ldquoQUWhich IS ARE TERM RELA-ION TERMrdquo is marked in bold This pattern relies on the macrosUWhich IS ARE TERM and RELATIONThanks to this architecture that takes advantage of the JAPE

rammars although we can still only deal with a subset ofatural language it is possible to extend this subset in a rela-ively easy way by updating the regular expressions in the JAPErammars This design choice ensures the easy portability of theystem with respect to both ontologies and natural languages

Relation similarity service from triples to answers

The RSS is the backbone of the question-answering sys-em The RSS component is invoked after the NL query haseen transformed into a term-relation form and classified intohe appropriate category The RSS is the main componentesponsible for generating an ontology-compliant logical queryssentially the RSS tries to make sense of the input queryy looking at the structure of the ontology and the informa-

ion stored in the target KBs as well as using string similarity

atching generic lexical resources such as WordNet and aomain-dependent lexicon obtained through the use of a Learn-ng Mechanism explained in Section 55

immi

ry example to classify the queries

An important aspect of the RSS is that it is interactive Inther words when ambiguity arises between two or more pos-ible terms or relations the user will be required to interpret theuery

Relation and concept names are identified and mapped withinhe ontology through the RSS and the Class Similarity Ser-ice (CSS) respectively (the CSS is a component which isart of the Relation Similarity Service) Syntactic techniquesike string algorithms are not useful when the equivalent ontol-gy terms have dissimilar labels from the user term Moreoverelations names can be considered as ldquosecond class citizensrdquohe rationality behind this is that mapping relations basedn its name (labels) is more difficult that mapping conceptsabels in the case of concepts often catch the meaning of

he semantic entity while in relations its meaning is giveny the type of its domain and its range rather than by itsame (typically vaguely defined as eg ldquorelated tordquo) with thexception of the cases in which the relations are presented inome ontologies as a concept (eg has-author modeled as aoncept Author in a given ontology) Therefore the similarityervices should use the ontology semantics to deal with theseituations

Proper names instead are normally mapped into instances byeans of string metric algorithms The disadvantage of such an

pproach is that without some facilities for dealing with unrec-gnized terms some questions are not accepted For instancequestion like ldquois there any researcher called Thompsonrdquoould not be accepted if Thompson is not identified as an

nstance in the current KB However a partial solution is imple-ented for affirmativenegative types of questions where weake sense of questions in which only one of two instances

s recognized eg in the query ldquois Enrico working in ibmrdquo

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 8: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

4

rtuiwn

tgt

FjaTwl

ldquowtls

ldquoat

aldquo

TsmO

ie

4e

atcficrFppota(AAr

V Lopez et al Web Semantics Science Service

13 Linguistic procedure for combination of queriesNevertheless a query can be a composition of two explicit

elationships between terms or a query can be a composition ofwo basic queries In this case the intermediate representationsually consists of two triples one triple per relationship Fornstance in ldquoare there any planet news written by researchersorking in aktrdquo where the two linguistic triples will be 〈planetews written researchers〉 and 〈which are working akt〉

Not only are these complex queries classified depending onhe kind of triples they generate but also the resultant triple cate-ories are the driving force to generate an answer by combininghe triples in an appropriate way

There are three ways in which queries can be combinedirstly queries can be combined by using a ldquoandrdquo or ldquoorrdquo con-

unction operator as in ldquowhich projects are funded by epsrc andre about semantic webrdquo This query will generate two Query-riples 〈funded (projects epsrc)〉 and 〈about (projects semanticeb)〉 and the subsequent answer will be a combination of both

ists obtained after resolving each tripleSecondly a query may be conditioned to a second query as in

which researchers wrote publications related to social aspectsrdquohere the second clause modifies one of the previous terms In

his example and at this stage ambiguity cannot be solved byinguistic procedures therefore the term to be modified by theecond clause remains uncertain

Finally we can have combinations of two basic patterns egwhat is the web address of Peter who works for aktrdquo or ldquowhichre the projects led by Enrico which are related to multimedia

echnologiesrdquo (see Fig 7)

Once the intermediate representation is created prepositionsnd auxiliary words such as ldquoinrdquo ldquoaboutrdquo ldquoofrdquo ldquoatrdquo ldquoforrdquobyrdquo ldquotherdquo ldquodoesrdquo ldquodordquo ldquodidrdquo are removed from the Query-

oVtp

Fig 8 Set of JAPE rules used in the quer

Agents on the World Wide Web 5 (2007) 72ndash105 79

riples because in the current AquaLog implementation theiremantics are not exploited to provide any additional infor-ation to help in the mapping between Query-Triples into thentology-compliant triples (Onto-Triples)The resultant Query-Triple is the input for the Relation Sim-

larity Services that process the combinations of queries asxplained in Section 53

2 The use of JAPE grammars in AquaLog illustrativexample

Consider the basic query in Fig 5 ldquowhat are the researchreas covered by the akt projectrdquo as an illustrative example inhe use of JAPE grammars to classify the query and obtain theorrect annotations to create a valid triple Two JAPE grammarles were written for AquaLog The first one in the GATE exe-ution list annotates the nouns ldquoresearchersrdquo and ldquoakt projectsrdquoelations ldquoarerdquo and ldquocovered byrdquo and query terms ldquowhatrdquo (seeig 8) For instance consider the rule RuleNP in Fig 8 Theattern on the left hand side (ie before ldquorarrrdquo) identifies nounhrases Noun phrases are word sequences that start with zeror more determiners (identified by the (DET) part of the pat-ern) Determiners can be followed by zero or more adverbsnd adjectives nouns or coordinating conjunctions in any orderidentified by the ((RB)ADJ)|NOUN|CC) part of the pattern)

noun phrase mandatorily finishes with a noun (NOUN) RBDJ NOUN and CC are macros and act as placeholders for other

ules For example the macro LIST used by the rule RuleQU1

r the macro VERB used by rule RuleREL1 in Fig 8 The macroERB contains the disjunction of seven patterns which means

hat the macro will fire if a word satisfies any of these sevenatterns eg last pattern identified words assigned to the VB

y example to generate annotations

80 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

e que

PaateR

(sTTQ

gntgs

5

tbtrEbtmdi

osq

tvplortoLtbnescss

maoaw

Fig 9 Set of JAPE rules used in th

OS tags POS tags are assigned in the ldquocategoryrdquo feature ofldquoTokenrdquo annotation used in GATE to encode the information

bout the analyzed question Any word sequence identified byhe left hand side of a rule can be referenced in its right hand sizeg relREL = rule = ldquoREL1rdquo is referenced by the macroELATION in the second grammar (Fig 9)

The second grammar file based on the previous annotationsrules) classifies the example query as ldquowh-generictermrdquo Theet of JAPE rules used to classify the query can be seen in Fig 9he pattern fired for that query ldquoQUWhich IS ARE TERM RELA-ION TERMrdquo is marked in bold This pattern relies on the macrosUWhich IS ARE TERM and RELATIONThanks to this architecture that takes advantage of the JAPE

rammars although we can still only deal with a subset ofatural language it is possible to extend this subset in a rela-ively easy way by updating the regular expressions in the JAPErammars This design choice ensures the easy portability of theystem with respect to both ontologies and natural languages

Relation similarity service from triples to answers

The RSS is the backbone of the question-answering sys-em The RSS component is invoked after the NL query haseen transformed into a term-relation form and classified intohe appropriate category The RSS is the main componentesponsible for generating an ontology-compliant logical queryssentially the RSS tries to make sense of the input queryy looking at the structure of the ontology and the informa-

ion stored in the target KBs as well as using string similarity

atching generic lexical resources such as WordNet and aomain-dependent lexicon obtained through the use of a Learn-ng Mechanism explained in Section 55

immi

ry example to classify the queries

An important aspect of the RSS is that it is interactive Inther words when ambiguity arises between two or more pos-ible terms or relations the user will be required to interpret theuery

Relation and concept names are identified and mapped withinhe ontology through the RSS and the Class Similarity Ser-ice (CSS) respectively (the CSS is a component which isart of the Relation Similarity Service) Syntactic techniquesike string algorithms are not useful when the equivalent ontol-gy terms have dissimilar labels from the user term Moreoverelations names can be considered as ldquosecond class citizensrdquohe rationality behind this is that mapping relations basedn its name (labels) is more difficult that mapping conceptsabels in the case of concepts often catch the meaning of

he semantic entity while in relations its meaning is giveny the type of its domain and its range rather than by itsame (typically vaguely defined as eg ldquorelated tordquo) with thexception of the cases in which the relations are presented inome ontologies as a concept (eg has-author modeled as aoncept Author in a given ontology) Therefore the similarityervices should use the ontology semantics to deal with theseituations

Proper names instead are normally mapped into instances byeans of string metric algorithms The disadvantage of such an

pproach is that without some facilities for dealing with unrec-gnized terms some questions are not accepted For instancequestion like ldquois there any researcher called Thompsonrdquoould not be accepted if Thompson is not identified as an

nstance in the current KB However a partial solution is imple-ented for affirmativenegative types of questions where weake sense of questions in which only one of two instances

s recognized eg in the query ldquois Enrico working in ibmrdquo

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 9: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

80 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

e que

PaateR

(sTTQ

gntgs

5

tbtrEbtmdi

osq

tvplortoLtbnescss

maoaw

Fig 9 Set of JAPE rules used in th

OS tags POS tags are assigned in the ldquocategoryrdquo feature ofldquoTokenrdquo annotation used in GATE to encode the information

bout the analyzed question Any word sequence identified byhe left hand side of a rule can be referenced in its right hand sizeg relREL = rule = ldquoREL1rdquo is referenced by the macroELATION in the second grammar (Fig 9)

The second grammar file based on the previous annotationsrules) classifies the example query as ldquowh-generictermrdquo Theet of JAPE rules used to classify the query can be seen in Fig 9he pattern fired for that query ldquoQUWhich IS ARE TERM RELA-ION TERMrdquo is marked in bold This pattern relies on the macrosUWhich IS ARE TERM and RELATIONThanks to this architecture that takes advantage of the JAPE

rammars although we can still only deal with a subset ofatural language it is possible to extend this subset in a rela-ively easy way by updating the regular expressions in the JAPErammars This design choice ensures the easy portability of theystem with respect to both ontologies and natural languages

Relation similarity service from triples to answers

The RSS is the backbone of the question-answering sys-em The RSS component is invoked after the NL query haseen transformed into a term-relation form and classified intohe appropriate category The RSS is the main componentesponsible for generating an ontology-compliant logical queryssentially the RSS tries to make sense of the input queryy looking at the structure of the ontology and the informa-

ion stored in the target KBs as well as using string similarity

atching generic lexical resources such as WordNet and aomain-dependent lexicon obtained through the use of a Learn-ng Mechanism explained in Section 55

immi

ry example to classify the queries

An important aspect of the RSS is that it is interactive Inther words when ambiguity arises between two or more pos-ible terms or relations the user will be required to interpret theuery

Relation and concept names are identified and mapped withinhe ontology through the RSS and the Class Similarity Ser-ice (CSS) respectively (the CSS is a component which isart of the Relation Similarity Service) Syntactic techniquesike string algorithms are not useful when the equivalent ontol-gy terms have dissimilar labels from the user term Moreoverelations names can be considered as ldquosecond class citizensrdquohe rationality behind this is that mapping relations basedn its name (labels) is more difficult that mapping conceptsabels in the case of concepts often catch the meaning of

he semantic entity while in relations its meaning is giveny the type of its domain and its range rather than by itsame (typically vaguely defined as eg ldquorelated tordquo) with thexception of the cases in which the relations are presented inome ontologies as a concept (eg has-author modeled as aoncept Author in a given ontology) Therefore the similarityervices should use the ontology semantics to deal with theseituations

Proper names instead are normally mapped into instances byeans of string metric algorithms The disadvantage of such an

pproach is that without some facilities for dealing with unrec-gnized terms some questions are not accepted For instancequestion like ldquois there any researcher called Thompsonrdquoould not be accepted if Thompson is not identified as an

nstance in the current KB However a partial solution is imple-ented for affirmativenegative types of questions where weake sense of questions in which only one of two instances

s recognized eg in the query ldquois Enrico working in ibmrdquo

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 10: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

wbrw

tb(eisdobor

pgtiswntIta

bt

i

5

Fri(btldquoaaila

KaUotrtoactcr

V Lopez et al Web Semantics Science Service

here ldquoEnricordquo is mapped into ldquoEnrico-mottardquo in the KBut ldquoibmrdquo is not found The answer in this case is an indi-ect negative answer namely the place were Enrico Motta isorkingIn any non-trivial natural language system it is important

o deal with the various sources of ambiguity and the possi-le ways of treating them Some sentences are syntacticallystructurally) ambiguous and although general world knowl-dge does not resolve this ambiguity within a specific domaint may happen that only one of the interpretations is pos-ible The key issue here is to determine some constraintserived from the domain knowledge and to apply them inrder to resolve ambiguity [14] Where the ambiguity cannote resolved by domain knowledge the only reasonable coursef action is to get the user to choose between the alternativeeadings

Moreover since every item on the Onto-Triple is an entryoint in the knowledge base or ontology they are also clickableiving the user the possibility to get more information abouthem Also AquaLog scans the answers for words denotingnstances that the system ldquoknows aboutrdquo ie which are repre-ented in the knowledge base and then adds hyperlinks to theseordsphrases indicating that the user can click on them andavigate through the information In fact the RSS is designedo provide justifications for every step of the user interactionntermediate results obtained through the process of creatinghe Onto-Triple are stored in auxiliary classes which can beccessed within the triple

Note that the category to which each Onto-Triple belongs can

e modified by the RSS during its life cycle in order to satisfyhe appropriate mappings of the triple within the ontology

In what follows we will show a few examples which arellustrative of the way the RSS works

tr

p

Fig 10 Example of AquaLog in actio

Agents on the World Wide Web 5 (2007) 72ndash105 81

1 RSS procedure for basic queries

Here we describe how the RSS deals with basic queriesor example letrsquos consider a basic question like ldquowhat are theesearch areas covered by aktrdquo (see Figs 10 and 11) The querys classified by the Linguistic component as a basic generic-typeone wh-query represented by a triple formed by an explicitinary relationship between two define terms) as shown is Sec-ion 411 Then the first step for the RSS is to identify thatresearch areasrdquo is actually a ldquoresearch areardquo in the target KBnd ldquoaktrdquo is a ldquoprojectrdquo through the use of string distance metricsnd WordNet synonyms if needed Whenever a successful matchs found the problem becomes one of finding a relation whichinks ldquoresearch areasrdquo or any of its subclasses to ldquoprojectsrdquo orny of its superclasses

By analyzing the taxonomy and relationships in the targetB AquaLog finds that the only relations between both terms

re ldquoaddresses-generic-area-of-interestrdquo and ldquouses-resourcerdquosing the WordNet lexicon AquaLog identifies that a syn-nym for ldquocoveredrdquo is ldquoaddressesrdquo and therefore suggests tohe user that the question could be interpreted in terms of theelation ldquoaddresses-generic-area-of-interestrdquo Having done thishe answer to the query is provided It is important to note that inrder to make sense of the triple 〈research areas covered akt〉ll the subclasses of the generic term ldquoresearch areasrdquo need to beonsidered To clarify this letrsquos consider a case in which we arealking about a generic term ldquopersonrdquo then the possible relationould be defined only for researchers students or academicsather than people in general Due to the inheritance of relations

hrough the subsumption hierarchy a similar type of analysis isequired for all the super-classes of the concept ldquoprojectsrdquo

Whenever multiple relations are possible candidates for inter-reting the query if the ontology does not provide ways to further

n for basic generic-type queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 11: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

82 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

al info

dtmruc

tttsfoctTtttldquo

avwldquobldquoaatwldquoi

tldquod

Fig 11 Example of AquaLog addition

iscriminate between them string matching is used to determinehe most likely candidate using the relation name the learning

echanism eventual aliases or synonyms provided by lexicalesources such as WordNet [21] If no relations are found bysing these methods then the user is asked to choose from theurrent list of candidates

A typical situation the RSS has to cope with is one in whichhe structure of the intermediate query does not match the wayhe information is represented in the ontology For instancehe query ldquowho is the secretary in Knowledge Media Instrdquo ashown in Fig 12 may be parsed into 〈person secretary kmi〉ollowing purely linguistic criteria while the ontology may berganized in terms of 〈secretary works-in-unit kmi〉 In theseases the RSS is able to reason about the mismatch re-classifyhe intermediate query and generate the correct logical queryhe procedure is as follows The question ldquowho is the secre-

ary in Knowledge Media Instrdquo is classified as a ldquobasic genericyperdquo for the Linguistic Component The first step is to iden-ify that KMi is a ldquoresearch instituterdquo which is also part of anorganizationrdquo in the target KB and who could be pointing to

agTt

Fig 12 Example of AquaLog in action

rmation screen from Fig 10 example

ldquopersonrdquo (sometimes an ldquoorganizationrdquo) Similarly to the pre-ious example the problem becomes one of finding a relationhich links a ldquopersonrdquo (or any of its subclasses like ldquoacademicrdquo

studentsrdquo ) to a ldquoresearch instituterdquo (or an ldquoorganizationrdquo)ut in this particular case there is also a successful matching forsecretaryrdquo in the KB in which ldquosecretaryrdquo is not a relation butsubclass of ldquopersonrdquo and therefore the triple is classified fromgeneric one formed by a binary relation ldquosecretaryrdquo between

he terms ldquopersonrdquo and ldquoresearch instituterdquo to a generic one inhich the relation is unknown or implicit between the terms

secretaryrdquo which is more specific than ldquopersonrdquo and ldquoresearchnstituterdquo

If we take the example question presented in Ref [14] ldquowhoakes the database courserdquo it may be that we are asking forlecturers who teach databasesrdquo or for ldquostudents who studyatabasesrdquo In this cases AquaLog will ask the user to select the

ppropriate relation within all the possible relations between aeneric person (lecturers students secretaries ) and a courseherefore the translation process from lexical to ontological

riples can go ahead without misinterpretations

for relations formed by a concept

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 12: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

tOrbldquoq

ffntstc

5(

4toogr

owildquooltadvoai

nt

atb

paldquoacOiaw

adgTosaot

gcsrldquofiildquo

V Lopez et al Web Semantics Science Service

Note that some additional lexical features which help in theranslation or put a restriction on the answer presented in thento-Triple are if the relation is passive or reflexive or the

elation is formed by ldquoisrdquo The last means that the relation cannote an ldquoad hocrdquo relation between terms but is a relation of typesubclass-ofrdquo between terms related in the taxonomy as in theuery ldquois John Domingue a PhD studentrdquo

Also a particular lexical feature is whether the relation isormed by verbs or by a combination of verbs and nouns thiseature is important because in the case that the relation containsouns the RSS has to make additional checks to map the rela-ion in the ontology For example consider the relation ldquois theecretaryrdquo the noun ldquosecretaryrdquo may correspond to a concept inhe target ontology while eg the relation ldquois the job titlerdquo mayorrespond to the slot ldquohas-job-titlerdquo for the concept person

2 RSS procedure for basic queries with clausesthree-term queries)

When we described the Linguistic Component in Section12 we discussed that in the case of the basic queries in whichhere is a modifier clause the triple generated is not in the formf a binary-relation but a ternary-relation That is to say that fromne Query-Triple composed of three terms two Onto-Triples areenerated by the RSS In these cases the ambiguity can also beelated to the way the triples are linked

Depending on the position of the modifier clause (consistingf a preposition plus a term) and the relation within the sentencee can have two different classifications one in which the clause

s related to the first term by an implicit relation for instanceare there any projects in KMi related to natural languagerdquor ldquoare there any projects from researchers related to naturalanguagerdquo and one where the clause is at the end and afterhe relation as in ldquowhich projects are headed by researchers inktrdquo or ldquowhat are the contact details for researchers in aktrdquo Theifferent classifications or query types determine whether the

erb is going to be part of the first triple as in the latter exampler part of the second triple as in the former example Howevert a linguistic level it is not possible to disambiguate the termn the sentence which the last part of the sentence ldquorelated to

5

r

Fig 13 Example of AquaLog in action for que

Agents on the World Wide Web 5 (2007) 72ndash105 83

atural languagerdquo (and ldquoin aktrdquo in the previous examples) referso

The RSS analysis relies on its knowledge of the particularpplication domain to solve such modifier attachment Alterna-ively the RSS may attempt to resolve the modifier attachmenty using heuristics

For example to handle the previous query ldquoare there anyrojects on the semantic web sponsored by epsrcrdquo the RSS usesheuristic which suggests attaching the last part of the sentencesponsored by epsrcrdquo to the closest term that is represented byclass or non-ground term in the ontology (eg in this case thelass ldquoprojectrdquo) Hence the RSS will create the following twonto-Triples a generic one 〈project sponsored epsrc〉 and one

n which the relation is implicit 〈project semantic web〉 Thenswer is a combination of a list of projects related to semanticeb and a list of proj ects sponsored by epsrc (Fig 13)However if we consider the query ldquowhich planet news stories

re written by researchers in aktrdquo and apply the same heuristic toisambiguate the modifier ldquoin aktrdquo the RSS will select the non-round term ldquoresearchersrdquo as the correct attachment for ldquoaktrdquoherefore to answer this query first it is necessary to get the listf researchers related to akt and then to get a list of planet newstories for each of the researchers It is important to notice thatrelation in the linguistic triple may be mapped into more thanne relation in the Onto-Triple as for example ldquoare writtenrdquo isranslated into ldquoowned-byrdquo or ldquohas-authorrdquo

The use of this heuristic to attach the modifier to the non-round term can be appreciated comparing the ontology triplesreated in Fig 13 (ldquoAre there any project on semantic webponsored by eprscrdquo) and Fig 14 (ldquoAre there any project byesearcher working in AKTrdquo) For the former the modifiersponsored by eprscrdquo is attached to the query term in therst triple ie ldquoprojectrdquo For the later the modifier ldquoworking

n aktrdquo is attached to the second term in the first triple ieresearchersrdquo

3 RSS procedure for combination of queries

Complex queries generate more than one intermediate rep-esentation as shown in the linguistic analysis in Section 413

ries with a modifier clause in the middle

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 13: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

84 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

for qu

Dl

tibiitnr

wldquoawa

b

Fig 14 Example of AquaLog in action

epending on the type of queries the triples are combined fol-owing different criteria

To clarify the kind of ambiguity we deal with letrsquos considerhe query ldquowho are the researchers in akt who have interestn knowledge reuserdquo By checking the ontology or if neededy using user feedback the RSS module can map the termsn the query either to instances or to classes in the ontology

n a way that it can determine that ldquoaktrdquo is a ldquoprojectrdquo andherefore it is not an instance of the classes ldquopersonsrdquo or ldquoorga-izationsrdquo or any of their subclasses Moreover in this case theelation ldquoresearchersldquois mapped into a ldquoconceptrdquo in the ontology

aoha

Fig 15 Example of conte

eries with a modifier clause at the end

hich is a subclass of ldquopersonrdquo and therefore the second clausewho have an interest in knowledge reuserdquo is without any doubtssumed to be related to ldquoresearchersrdquo The solution in this caseill be a combination of a list of researchers who work for akt

nd have interest in knowledge reuse (see Fig 15)However there could be other situations where the disam-

iguation cannot be resolved by use of linguistic knowledge

ndor heuristics andor the context or semantics in the ontol-gy eg in the query ldquowhich academic works with Peter whoas interest in semantic webrdquo In this case since ldquoacademicrdquond ldquopeterrdquo are respectively a subclass and an instance of ldquoper-

xt disambiguation

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 14: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

sIlatps

Ftihss〈Ha

filqbspsowpeiwCt

5

ottNMdebt

ldquoldquoA

aldquobh

rtRm

stbhgoWttcastotttoftoo

(Astp

5

wwaaoei

V Lopez et al Web Semantics Science Service

onrdquo the sentence is truly ambiguous even for a human reader7

n fact it can be understood as a combination of the resultingists of the two questions ldquowhich academic works with peterrdquond ldquowhich academic has an interest in semantic webrdquo or ashe relative query ldquowhich academic works with peter where theeter we are looking for has an interest in the semantic webrdquo Inuch cases user feedback is always required

To interpret a query different features play different rolesor example in a query like ldquowhich researchers wrote publica-

ions related to social aspectsrdquo the second term ldquopublicationrdquos mapped into a concept in our KB therefore the adoptedeuristic is that the concept has to be instantiated using theecond part of the sentence ldquorelated to social aspectsrdquo In conclu-ion the answer will be a combination of the generated triplesresearchers wrote publications〉 〈publications related akt〉owever frequently this heuristic is not enough to disambiguate

nd other features have to be usedIt is also important to underline the role that a triple elementrsquos

eatures can play For instance an important feature of a relations whether it contains the main verb of the sentence namely theexical verb We can see this if we consider the following twoueries (1) ldquowhich academics work in the akt project sponsoredy epsrcrdquo (2) ldquowhich academics working in the akt project areponsored by epsrcrdquo In both examples the second term ldquoaktrojectrdquo is an instance so the problem becomes to identify if theecond clause ldquosponsored by epsrcrdquo is related to ldquoakt projectrdquor to ldquoacademicsrdquo In the first example the main verb is ldquoworkrdquohich separates the nominal group ldquowhich academicsrdquo and theredicate ldquoakt project sponsored by epsrcrdquo thus ldquosponsored bypsrcrdquo is related to ldquoakt projectrdquo In the second the main verbs ldquoare sponsoredrdquo whose nominal group is ldquowhich academicsorking in aktrdquo and whose predicate is ldquosponsored by epsrcrdquoonsequently ldquosponsored by epsrcrdquo is related to the subject of

he previous clause which is ldquoacademicsrdquo

4 Class similarity service

String algorithms are used to find mappings between ontol-gy term names and pretty names8 and any of the terms insidehe linguistic triples (or its WordNet synonyms) obtained fromhe userrsquos query They are based on String Distance Metrics forame-Matching Tasks using an open-source API from Carnegieellon University [13] This comprises a number of string

istance metrics proposed by different communities includingdit-distance metrics fast heuristic string comparators token-ased distance metrics and hybrid methods AquaLog combineshe use of these metrics (it mainly uses the Jaro-based met-

7 Note that there will not be ambiguity if the user reformulates the question aswhich academic works with Peter and has an interest in semantic webrdquo wherehas an interest in semantic webrdquo would be correctly attached to ldquoacademicrdquo byquaLog8 Naming conventions vary depending on the KR used to represent the KBnd may even change with ontologiesmdasheg an ontology can have slots such asvariant namerdquo or ldquopretty namerdquo AquaLog deals with differences between KRsy means of the plug-in mechanism Differences between ontologies need to beandled by specifying this information in a configuration file

ti

(

(

Agents on the World Wide Web 5 (2007) 72ndash105 85

ics) with thresholds Thresholds are set up depending on theask of looking for concepts names relations or instances Seeef [13] for a experimental comparison between the differentetricsHowever the use of string metrics and open-domain lexico-

emantic resources [45] such as WordNet to map the genericerm of the linguistic triple into a term in the ontology may note enough String algorithms fail when the terms to compareave dissimilar labels (ie ldquoresearcherrdquo and ldquoacademicrdquo) andeneric thesaurus like WordNet may not include all the syn-nyms and is limited in the use of nominal compounds (ieordNet contains the term ldquomunicipalityrdquo but not an ontology

erm like ldquomunicipal-unitrdquo) Therefore an additional combina-ion of methods may be used in order to obtain the possibleandidates in the ontology For instance a lexicon can be gener-ted manually or can be built through a learning mechanism (aimilar simplified approach to the learning mechanism for rela-ions explained in Section 6) The only requirement to obtainntology classes that can potentially map an unmapped linguis-ic term is the availability of the ontology mapping for the othererm of the triple In this way through the ontology relationshipshat are valid for the ontology mapped term we can identify a setf possible candidate classes that can complete the triple Usereedback is required to indicate if one of the candidate terms ishe one we are looking for so that AquaLog through the usef the learning mechanism will be able to learn it for futureccasions

The WordNet component is using the Java implementationJWNL) of WordNet 171 [29] The next implementation ofquaLog will be coupled with the new version on WordNet

o it can make use of Hyponyms and Hypernyms Currentlyhe component of AquaLog which deals with WordNet does noterform any sense disambiguation

5 Answer engine

The Answer Engine is a component of the RSS It is invokedhen the Onto-Triple is completed It contains the methodshich take as an input the Onto-Triple and infer the required

nswer to the userrsquos queries In order to provide a coherentnswer the category of each triple tells the answer engine notnly about how the triple must be resolved (or what answer toxpect) but also how triples can be linked with each other Fornstance AquaLog provides three mechanisms (depending onhe triple categories) for operationally integrating the triplersquosnformation to generate an answer These mechanisms are

1) Andor linking eg in the query ldquowho has an interest inontologies or in knowledge reuserdquo the result will be afusion of the instances of people who have an interest inontologies and the people who are interested in semanticweb

2) Conditional link to a term for example in ldquowhich KMi

academics work in the akt project sponsored by eprscrdquothe second triple 〈akt project sponsored eprsc〉 must beresolved and the instance representing the ldquoakt project spon-sored by eprscrdquo identified to get the list of academics

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 15: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

86 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

appi

(

tvt

6

dmnoWgcfTttpqowl

iodl

6

dTtmu

ttbdvtcdv

bthat are consistent with the observed training examples In ourcase the training examples are given by the user-refinement (dis-ambiguation) of the query while the hypotheses are structuredin terms of the context parameters Moreover a version space is

Fig 16 Example of jargon m

required for the first triple 〈KMi academics work aktproject〉

3) Conditional link to a triple for instance in ldquoWhat is thehomepage of the researchers working on the semantic webrdquothe second triple 〈researchers working semantic web〉 mustbe resolved and the list of researchers obtained prior togenerating an answer for the first triple 〈 homepageresearchers〉

The solution is converted into English using simple templateechniques before presenting them to the user Future AquaLogersions will be integrated with a natural language generationool

Learning mechanism for relations

Since the universe of discourse we are working within isetermined by and limited to the particular ontology used nor-ally there will be a number of discrepancies between the

atural language questions prompted by the user and the setf terms recognized in the ontology External resources likeordNet generally help in making sense of unknown terms

iving a set of synonyms and semantically related words whichould be detected in the knowledge base However in quite aew cases the RSS fails in the production of a genuine Onto-riple because of user-specific jargon found in the linguistic

riple In such a case it becomes necessary to learn the newerms employed by the user and disambiguate them in order toroduce an adequate mapping of the classes of the ontology A

uite common and highly generic example in our departmentalntology is the relation works-for which users normally relateith a number of different expressions is working works col-

aborates is involved (see Fig 16) In all these cases the userwt

ng through user-interaction

s asked to disambiguate the relation (choosing from the set ofntology relations consistent with the questionrsquos arguments) andecide whether to learn a new mapping between hisher natural-anguage-universe and the reduced ontology-language-universe

1 Architecture

The learning mechanism (LM) in AquaLog consists of twoifferent methods the learning and the matching (see Fig 17)he latter is called whenever the RSS cannot relate a linguistic

riple to the ontology or the knowledge base while the for-er is always called after the user manually disambiguates an

nrecognized term (and this substitution gives a positive result)When a new item is learned it is recorded in a database

ogether with the relation it refers to and a series of constraintshat will determine its reuse within similar contexts As wille explained below the notion of context is crucial in order toeliver a feasible matching of the recorded words In the currentersion the context is defined by the arguments of the questionhe name of the ontology and the user information This set ofharacteristics constitutes a representation of the context andefines a structured space of hypothesis analogue9 to that of aersion space

A version space is an inductive learning technique proposedy Mitchell [42] in order to represent the subset of all hypotheses

9 Although making use of a concept borrowed from machine learning studiese want to stress the fact that the learning mechanism is not a machine learning

echnique in the classical sense

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 16: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 87

mec

niomit6iof

tswtsdebmd

wiiptsfooei

os

hfoInR

6

du

wtldquowtldquots

ldquortte

pidr

Fig 17 The learning

ormally represented just by its lower and upper bounds (max-mally general and maximally specific hypothesis sets) insteadf listing all consistent hypotheses In the case of our learningechanism these bounds are partially inferred through a general-

zation process (Section 63) that will in future be applicable alsoo the user-related and community derived knowledge (Section4) The creation of specific algorithms for combining the lex-cal and the community dimensions will increase the precisionf the context-definition and provide additional personalizationeatures

Thus when a question with a similar context is prompted ifhe RSS cannot disambiguate the relation-name the database iscanned for some matching results Subsequently these resultsill be context-proved in order to check their consistency with

he stored version spaces By tightening and loosening the con-traints of the version space the learning mechanism is able toetermine when to propose a substitution and when not to Forxample the user-constraint is a feature that is often bypassedecause we are inside a generic user session or because weight want to have all the results of all the users from a single

atabase queryBefore the matching method we are always in a situation

here the onto-triple is incomplete the relation is unknown or its a concept If the new word is found in the database the contexts checked to see if it is consistent with what has been recordedreviously If this gives a positive result we can have a valid onto-riple substitution that triggers the inference engine (this lattercans the knowledge base for results) instead if the matchingails a user disambiguation is needed in order to complete thento-triple In this case before letting the inference engine workut the results the context is drawn from the particular questionntered and it is learned together with the relation and the other

nformation in the version space

Of course the matching methodrsquos movement through thentology is opposite to the learning methodrsquos one The lattertarting from the arguments tries to go up until it reaches the

tsid

hanism architecture

ighest valid classes possible (GetContext method) while theormer takes the two arguments and checks if they are subclassesf what has been stored in the database (CheckContext method)t is also important to notice that the Learning Mechanism doesot have a question classification on its own but relies on theSS classification

2 Context definition

As said above the notion of context is fundamental in order toeliver a feasible substitution service In fact two people couldse the same jargon but mean different things

For example letrsquos consider the question ldquoWho collaboratesith the knowledge media instituterdquo (Fig 16) and assume that

he RSS is not able to solve the linguistic ambiguity of the wordcollaboraterdquo The first time some help is needed from the userho selects ldquohas-affiliation-to-unitrdquo from a list of possible rela-

ions in the ontology A mapping is therefore created betweencollaboraterdquo and ldquohas-affiliation-to-unitrdquo so that the next timehe learning mechanism is called it will be able to recognize thispecific user jargon

Letrsquos imagine a user who asks the system the same questionwho collaborates with the knowledge media instituterdquo but iseferring to other research labs or academic units involved withhe knowledge media institute In fact if asked to choose fromhe list of possible ontology relations heshe would possiblynter ldquoworks-in-the-same-projectrdquo

The problem therefore is to maintain the two separated map-ings while still providing some kind of generalization Thiss achieved through the definition of the question context asetermined by its coordinates in the ontology In fact since theeferring (and pluggable) ontology is our universe of discourse

he context must be found within this universe In particularince we are dealing with triples and in the triple what we learns usually the relation (that is the middle item) the context iselimited by the two arguments of the triple In the ontology

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 17: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

8 s and Agents on the World Wide Web 5 (2007) 72ndash105

tt

kldquoatmilot

6

leEat

rshcattgmtowa

ltapdbtoic

tmbasiaa

t

agm

dond

immrwtcb

raiqrtaqtmrv

6

sawhsa

8 V Lopez et al Web Semantics Science Service

hese correspond to two classes or two instances connected byhe relation

Therefore in the question ldquowho collaborates with thenowledge media instituterdquo the context of the mapping fromcollaboratesrdquo to ldquohas-affiliation-to-unitrdquo is given by the tworguments ldquopersonrdquo (in fact in the ontology ldquowhordquo is alwaysranslated into ldquopersonrdquo or ldquoorganizationrdquo) and ldquoknowledge

edia instituterdquo What is stored in the database for future reuses the new word (which is also the key field in order to access theexicon during the matching method) its mapping in the ontol-gy the two context-arguments the name of the ontology andhe user details

3 Context generalization

This kind of recorded context is quite specific and does notet other questions benefit from the same learned mapping Forxample if afterwards we asked ldquowho collaborates with thedinburgh department of informaticsrdquo we would not get anppropriate matching even if the mapping made sense again inhis case

In order to generalize these results the strategy adopted is toecord the most generic classes in the ontology which corre-pond to the two triplersquos arguments and at the same time canandle the same relation In our case we would store the con-epts ldquopeoplerdquo and ldquoorganization-unitrdquo This is achieved throughbacktracking algorithm in the Learning Mechanism that takes

he relation identifies its type (the type already corresponds tohe highest possible class of one argument by definition) andoes through all the connected superclasses of the other argu-ent while checking if they can handle that same relation with

he given type Thus only the highest suitable classes of an ontol-gyrsquos branch are kept and all the questions similar to the onese have seen will fall within the same set since their arguments

re subclasses or instances of the same conceptsIf we go back to the first example presented (ldquowho col-

aborates with the knowledge media instituterdquo) we can seehat the difference in meaning between ldquocollaboraterdquo rarr ldquohas-ffiliation-to-unitrdquo and ldquocollaboraterdquo rarr ldquoworks-in-the-same-rojectrdquo is preserved because the two mappings entail twoifferent contexts Namely in the first case the context is giveny 〈people〉 and 〈organization-unit〉 while in the second casehe context will be 〈organization〉 and 〈organization-unit〉 Anyther matching could not mistake the two since what is learneds abstract but still specific enough to rule out the differentases

A similar example can be found in the question ldquowho arehe researchers working in aktrdquo (see Fig 18) the context of the

apping from ldquoworkingrdquo to ldquohas-research-memberrdquo is giveny the highest ontology classes which correspond to the tworguments ldquoresearcherrdquo and ldquoaktrdquo as far as they can handle theame relation What is stored in the database for a future reuses the new word its mapping in the ontology the two context-

rguments (in our case we would store the concepts ldquopeoplerdquond ldquoprojectrdquo) the name of the ontology and the user details

Other questions which have a similar context benefit fromhe same learned mapping For example if afterwards we

esmt

Fig 18 Users disambiguation example

sked ldquowho are the people working in buddyspacerdquo we wouldet an appropriate matching from ldquoworkingrdquo to ldquohas-research-emberrdquoIt is also important to notice that the Learning Mechanism

oes not have a question classification on its own but it reliesn the RSS classification Namely the question is firstly recog-ized and then worked out differently depending on the specificifferences

For example we can have a generic-question learning (ldquowhos involved in the AKT projectrdquo) where the first generic argu-

ent (ldquoWhowhichrdquo) needs more processing since it can beapped to both a person or an organization or an unknown-

elation learning (ldquois there any project about the semanticebrdquo) where the user selection and the context are mapped

o the apparent lack of a relation in the linguistic triple Alsoombinations of questions are taken into account since they cane decomposed into a series of basic queries

An interesting case is when the relation in the linguistic tripleelation is not found in the ontology and is then recognized as

concept In this case the learning mechanism performs anteration of the matching in order to capture the real form of theuestion and queries the database as it would for an unknownelation type of question (see Fig 19) The data that are stored inhe database during the learning of a concept-case are the sames they would have been for a normal unknown-relation type ofuestion plus an additional field that serves as a reminder ofhe fact that we are within this exception Therefore during the

atching the database is queried as it would be for an unknownelation type plus the constraint that is a concept In this way theersion space is reduced only to the cases we are interested in

4 User profiles and communities

Another important feature of the learning mechanism is itsupport for a community of users As said above the user detailsre maintained within the version space and can be consideredhen interpreting a query AquaLog allows the user to enterisher personal information and thus to log in and start a ses-ion where all the actions performed on the learned lexicon tablere also strictly connected to hisher profile (see Fig 20) For

xample during a specific user-session it is possible to deleteome previously recorded mappings an action that is not per-itted to the generic user This latter has the roughest access

o the learned material having no constraints on the user field

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 18: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 89

hanis

tl

aibapsubiacs

7

kvwtRtbr

Fig 19 Learning mec

he database query will return many more mappings and quiteikely also meanings that are not desired

Future work on the learning mechanism will investigate theugmentation of the user-profile In fact through a specific aux-liary ontology that describes a series of userrsquos profiles it woulde possible to infer connections between the type of mappingnd the type of user Namely it would be possible to correlate aarticular jargon to a set of users Moreover through a reasoningervice this correlation could become dynamic being contin-ally extended or diminished consistently with the relationsetween userrsquos choices and userrsquos information For example

f the LM detects that a number of registered users all char-cterized as PhD students keep employing the same jargon itould extend the same mappings to all the other registered PhDtudents

o(A

Fig 20 Architecture to suppo

m matching example

Evaluation

AquaLog allows users who have a question in mind andnow something about the domain to query the semantic markupiewed as a knowledge base The aim is to provide a systemhich does not require users to learn specialized vocabulary or

o know the structure of the knowledge base but as pointed out inef [14] though they have to have some idea of the contents of

he domain they may have some misconceptions Therefore toe realistic some process of familiarization at least is normallyequired

A full evaluation of AquaLog requires both an evaluationf its query answering ability and the overall user experienceSection 72) Moreover because one of our key aims is to makequaLog an interface for the semantic web the portability

rt a usersrsquo community

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 19: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

9 s and

a(

filto

bull

bull

bull

bull

bull

MattIo

7

gtqasLtc

ofcttasp

mt

co

7

bkquTcq

iqLsebitwst

bfttttoorrtt2ospqq

77iaconsidering that almost no linguistic restriction was imposed onthe questions

A closer analysis shows that 7 of the 69 queries 1014 of

0 V Lopez et al Web Semantics Science Service

cross ontologies will also have to be addressed formallySection 73)

We analyzed the failures and divided them into the followingve categories (note that a query may fail at several different

evels eg linguistic failure and service failure though concep-ual failure is only computed when there is not any other kindf failure)

Linguistic failure This occurs when the NLP component isunable to generate the intermediate representation (but thequestion can usually be reformulated and answered)Data model failure This occurs when the NL query is simplytoo complicated for the intermediate representationRSS failure This occurs when the Relation Similarity Serviceof AquaLog is unable to map an intermediate representationto the correct ontology-compliant logical expressionConceptual failure This occurs when the ontology does notcover the query eg lack of an appropriate ontology relationor term to map with or if the ontology is wrongly populatedin a way that hampers the mapping (eg instances that shouldbe classes)Service failure In the context of the semantic web we believethat these failures are less to do with shortcomings of theontology than with the lack of appropriate services definedover the ontology

The improvement achieved due to the use of the Learningechanism (LM) in reducing the number of user interactions is

lso evaluated in Section 72 First AquaLog is evaluated withhe LM deactivated For a second test the LM is activated athe beginning of the experiment in which the database is emptyn the third test a second iteration is performed over the resultsbtained from the second iteration

1 Correctness of an answer

Users query the information using terms from their own jar-on This NL query is formalized into predicates that correspondo classes and relations in the ontology Further variables in auery correspond to instances in that ontology Therefore annswer is judged as correct with respect to a query over a singlepecific ontology In order for an answer to be correct Aqua-og has to align the vocabularies of both the asking query and

he answering ontology Therefore a valid answer is the oneonsidered correct in the ontology world

AquaLog fails to give an answer if the knowledge is in thentology but it cannot find it (linguistic failure data modelailure RSS failure or service failure) Please note that a con-eptual failure is not considered as an AquaLog failure becausehe ontology does not cover the information needed to map allhe terms or relations in the user query Moreover a correctnswer corresponding to a complete ontology-compliant repre-entation may give no results at all because the ontology is not

opulated

AquaLog may need the user to help disambiguate a possibleapping for the query This is not considered a failure However

he performance of AquaLog was also evaluated for it The user

t

Agents on the World Wide Web 5 (2007) 72ndash105

an decide to use the LM or not that decision has a direct impactn the performance as the results will show

2 Query answering ability

The aim is to assess to what extent the AquaLog applicationuilt using AquaLog with the AKT ontology10 and the KMinowledge base satisfied user expectations about the range ofuestions the system should be able to answer The ontologysed for the evaluation is also part of the KMi semantic portal11

herefore the knowledge base is dynamically populated andonstantly changing and as a result of this a valid answer to auery may be different at different moments

This version of AquaLog does not try to handle a NL queryf the query is not understood and classified into a type It isuite easy to extend the linguistic component to increase Aqua-og linguistic abilities However it is quite difficult to devise aublanguage which is sufficiently expressive therefore we alsovaluate if a question can be easily reformulated to be understoody AquaLog A second aim of the experiment was also to providenformation about the nature of the possible extensions neededo the ontology and the linguistic componentsmdashie we not onlyanted to assess the current coverage of AquaLog but also to get

ome data about the complexity of the possible changes requiredo generate the next version of the system

Thus we asked 10 members of KMi none of whom hadeen involved in the AquaLog project to generate questionsor the system Because one of the aims of the experiment waso measure the linguistic coverage of AquaLog with respecto user needs we did not provide them with much informa-ion about AquaLogrsquos linguistic ability However we did tellhem something about the conceptual coverage of the ontol-gy pointing out that its aim was to model the key elementsf a research lab such as publications technologies projectsesearch areas people etc We also pointed out that the cur-ent KB is limited in its handling of temporal informationherefore we asked them not to ask questions which requiredemporal reasoning (eg today last month between 2004 and005 last year etc) Because no lsquoquality controlrsquo was carriedut on the questions it was admissible for these to containome spelling mistakes and even grammatical errors Also weointed out that AquaLog is not a conversational system Eachuery is resolved on its own with no references to previousueries

21 Conclusions and discussion

211 Results We collected in total 69 questions (see resultsn Table 1) 40 of which were handled correctly by AquaLog iepproximately 58 of the total This was a pretty good result

he total present a conceptual failure Therefore only 4782 of

10 httpkmiopenacukprojectsaktref-onto11 httpsemanticwebkmiopenacuk

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 20: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105 91

Table 1AquaLog evaluation

AquaLog success in creating a valid Onto-Triple or if it fails to complete the Onto-Triple is due to an ontology conceptual failureTotal number of successful queries (40 of 69) 58mdashfrom this 10 of 33 (3030)

has no answer because ontology is not populatedwith data

Total number of successful queries without conceptual failure (33 of 69) 4782Total number of queries with only conceptual failure (7 of 69) 1014

AquaLog failuresTotal number failures (same query can fail for more than one reason) (29 of 69) 42RSS failure (8 of 69) 1159Service failure (10 of 69) 1449Data Model failure (0 of 69) 0Linguistic failure (16 of 69) 2318

AquaLog easily reformulated questionsTotal num queries easily reformulated (19 of 29) 655 of failures

With conceptual failure (7 of 19) 3684

B

tF(d

diawwtebScadgao

7ltl

bull

bull

bull

bull

No conceptual failure but no populated

old values represent the final results ()

he total (33 of 69 queries) reached the ontology mapping stageurthermore 3030 of the correctly ontology mapped queries10 of the 33 queries) cannot be answered because the ontologyoes not contain the required data (not populated)

Therefore although AquaLog handles 58 of the queriesue to the lack of conceptual coverage in the ontology and howt is populated for the user only 3333 (23 of 69) will have

result (a non-empty answer) These limitations on coverageill not be obvious for the user as a result independently ofhether a particular set of queries in answered or not the sys-

em becomes unusable On the other hand these limitations areasily overcome by extending and populating the ontology Weelieve that the quality of ontologies will improve thanks to theemantic Web scenario Moreover as said before AquaLog isurrently integrated with the KMi semantic web portal whichutomatically populates the ontology with information found inepartmental databases and web sites Furthermore for the nexteneration of our QA system (explained in Section 85) we areiming to use more than one ontology so the limitations in onentology can be overcome by another ontology

212 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 2318 of the total numberof questions (16 of 69) This was by far the most commonproblem In fact we expected this considering the few lin-guistic restrictions imposed on the evaluation compared tothe complexity of natural language Errors were due to (1)questions not recognized or classified like when a multiplerelation is used eg in ldquowhat are the challenges and objec-tives [ ]rdquo (2) questions annotated wrongly like tokens not

recognized as a verb by GATE eg ldquopresent resultsrdquo andtherefore relations annotated wrongly or because of the useof parenthesis in the middle of the query or special names likeldquodotkomrdquo or numbers like a year eg ldquoin 1995rdquo

(6 of 19) 3157

Data model failure Intriguingly this type of failure neveroccurred and our intuition is that this was the case not onlybecause the relational model is actually a pretty good way tomodel queries but also because the ontology-driven nature ofthe exercise ensured that people only formulated queries thatcould in theory (if not in practice) be answered by reasoningabout triples in the departmental KBRSS failures accounted for 1159 of the total number ofquestions (8 of 69) The RSS fails to correctly map an ontol-ogy compliant query mainly due to (1) the use of nominalcompounds (eg terms that are a combination of two classesas ldquoakt researchersrdquo) (2) when a set of candidate relationsis obtained through the study of the ontology the relation isselected through syntactic matching between labels throughthe use of string metrics and WordNet synonyms not by themeaning eg in ldquowhat is John Domingue working onrdquo wemap into the triple 〈organization works-for john-domingue〉instead of 〈research-area have-interest john-domingue〉or in ldquohow can I contact Vanessardquo due to the stringalgorithms ldquocontactrdquo is mapped to the relation ldquoemployment-contract-typerdquo between a ldquoresearch-staff- memberrdquo andldquoemployment-contract-typerdquo (3) queries type ldquohow longrdquo arenot implemented in this version (4) non-recognizing conceptsin the ontology when they are accompanied by a verb as partof the linguistic relation like the class ldquopublicationrdquo on thelinguistic relation ldquohave publicationsrdquoService failures Several queries essentially asked forservices to be defined over the ontologies For instance onequery asked about ldquothe top researchersrdquo which requiresa mechanism for ranking researchers in the labmdashpeoplecould be ranked according to citation impact formal statusin the department etc Therefore we defined this additionalcategory which accounted for 1449 of failed questions inthe total number of question (10 of 69) In this evaluation we

identified the following key words that are not understoodby this AquaLog version main most proportion most newsame as share same other and different No work has beendone yet in relation to the services failures Three kind of

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 21: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

92 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

Table 2Learning mechanism performance

Number of queries (total 45) Without the LM LM first iteration (from scratch) LM second iteration

0 iterations with user 3777 (17 of 45) 40 (18 of 45) 6444 (29 of 45)1 iteration with user 3555 (16 of 45) 3555 (16 of 45) 3555 (16 of 45)23

7islfilboef6lbac

ibf

7FfwuirtW

Fn

eitStch4fi

7

Wicwcoicwcqr

t

iterations with user 20 (9 of 45)iterations with user 666 (3 of 45)

(43 iterations)

services have been identified ranking similaritycomparativeand for proportionspercentages which remains a future lineof work for future versions of the system

213 Reformulation of questions As indicated in Refs [14]t is difficult to devise a sublanguage which is sufficiently expres-ive yet avoids ambiguity and seems reasonably natural Theinguistic and services failures together account for the 3767ailures (the total number of failures including the RSS failuress 42) On many occasions questions can be easily reformu-ated by the user to avoid these failures so that the question cane recognized by AquaLog (linguistic failure) avoiding the usef nominal compounds (typical RSS failure) or avoiding unnec-ssary functional words like different main most of (serviceailure) In our evaluation 2753 of the total questions (19 of9) will be correctly handled by AquaLog by easily reformu-ating them that means 6551 of the failures (19 of 29) cane easily avoided An example of a query that AquaLog is notble to handle or easily reformulate is ldquowhich main areas areorporately fundedrdquo

To conclude if we relax the evaluation restrictions allow-ng reformulation then 855 of our queries can be handledy AquaLog (independently of the occurrence of a conceptualailure)

214 Performance As the results show (see Table 2 andig 21) using the LM in the best of the cases we can improverom 3777 of queries that can be answered in an automaticay to 6444 queries Queries that need one iteration with theser are quite frequent (3555) even if the LM is used This

s because in this version of AquaLog the LM is only used forelations not for terms (eg if the user asks for the term ldquoPeterrdquohe RSS will require him to select between ldquoPeter Scottrdquo ldquoPeter

halleyrdquo and among all the other ldquoPeterrsquosrdquo present in the KB)

Fig 21 Learning mechanism performance

aoHatbea

spoin

205 (9 of 45) 0 (0 of 45)666 (2 of 45) 0 (0 of 45)(40 iterations) (16 iterations)

urthermore with the use of the LM the number of queries thateed 2 or 3 iterations are dramatically reduced

Note that the LM improves the performance of the systemven for the first iteration (from 3777 queries that require noteration to 40) Thanks to the use of the notion of contexto find similar but not identical learned queries (explained inection 6) the LM can help to disambiguate a query even if it is

he first time the query is presented to the system These numbersan seem to the user like a small improvement in performanceowever we should take into account that the set of queries (total5) is also quite small logically the performance of the systemor the 1st iteration of the LM improves if there is an incrementn the number of queries

3 Portability across ontologies

In order to evaluate portability we interfaced AquaLog to theine Ontology [50] an ontology used to illustrate the spec-

fication of the OWL W3C recommendation The experimentonfirmed the assumption that AquaLog is ontology portable ase did not notice any hitch in the behaviour of this configuration

ompared to the others built previously However this ontol-gy highlighted some AquaLog limitations to be addressed Fornstance a question like ldquowhich wines are recommended withakesrdquo will fail because there is not a direct relation betweenines and desserts there is a mediating concept called ldquomeal-

ourserdquo However the knowledge is in the ontology and theuestion can be addressed if reformulated as ldquowhat wines areecommended for dessert courses based on cakesrdquo

The wine ontology is only an example for the OWL languageherefore it does not have much information instantiated ands a result it does not present a complete coverage for many ofur queries or no answer can be found for most of the questionsowever it is a good test case for the evaluation of the Linguistic

nd Similarity Components responsible for creating the correctranslated ontology compliance triple (from which an answer cane inferred in a relatively easy way) More importantly it makesvident what are the AquaLog assumptions over the ontologys explained in the next subsection

The ldquowine and food ontologyrdquo was translated into OCML andtored in the WebOnto server12 so that the AquaLog WebOntolugin was used For the configuration of AquaLog with the new

ntology it was enough to modify the configuration files spec-fying the ontology server name and location and the ontologyame A quick familiarization with the ontology allowed us in

12 httpplainmooropenacuk3000webonto

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 22: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

tthtwtIsldquoomc

WWeutptmtiiclg

7

mh

(otsrdpTcfllSqtcpb[do

gql

roottq

oatfwotcs

ttWedao

fomrAsbeSoFposoba

732 Conclusions and discussion7321 Results We collected in total 68 questions As the wineontology is an example for the OWL language the ontology does

V Lopez et al Web Semantics Science Service

he first instance to define the AquaLog ontology assumptionshat are presented in the next subsection In second instanceaving a look at the few concepts in the ontology allowed uso configure the file in which ldquowhererdquo questions are associatedith the concept ldquoregionrdquo and ldquowhenrdquo with the concept ldquovin-

ageyearrdquo (there is no appropriate association for ldquowhordquo queries)n third instance the lexicon can be manually augmented withynonyms for the ontology class ldquomealcourserdquo like ldquostarterrdquoentreerdquo ldquofoodrdquo ldquosnackrdquo ldquosupperrdquo or like ldquopuddingrdquo for thentology class ldquodessertcourserdquo Though this last point it is notandatory it improves the recall of the system especially in

ases where WordNet does not provide all frequent synonymsThe questions used in this evaluation were based on the

ine Society ldquoWine and Food a Guidelinerdquo by Marcel Orfordilliams13 as our own wine and food knowledge was not deep

nough to make varied questions They were formulated by aser familiar with AquaLog (the third author) as in contrast withhe previous evaluation in which questions were drawn from aool of users outside the AquaLog project The reason for this ishe different purpose in the evaluation while in the first experi-

ent one of the aims was the satisfaction of the user in respecto the linguistic coverage in this second experiment we werenterested in a software evaluation approach in which the testers quite familiar with the system and can formulate a subset ofomplex questions intended to test the performance beyond theinguistic limitations (which can be extended by augmenting therammars)

31 AquaLog assumptions over the ontologyIn this section we examine key assumptions that AquaLog

akes about the format of semantic information it handles andow these balance portability against reasoning power

In AquaLog we use triples as the intermediate representationobtained from the NL query) These triples are mapped onto thentology by a ldquolightrdquo reasoning about the ontology taxonomyypes relationships and instances In a portable ontology-basedystem for the open SW scenario we cannot expect complexeasoning over very expressive ontologies because this requiresetailed knowledge of ontology structure A similar philoso-hy was applied to the design of the query interface used in theAP framework for semantic searches [22] This query interfacealled GetData was created to provide a lighter weight inter-ace (in contrast to very expressive query languages that requireots of computational resources to process such as the queryanguages SeRQL developed for to Sesame RDF platform andPARQL for OWL) Guha et al argue that ldquoThe lighter weightuery language could be used for querying on the open uncon-rolled Web in contexts where the site might not have muchontrol over who is issuing the queries whereas a more com-lete query language interface is targeted at the comparativelyetter behaved and more predictable area behind the firewallrdquo

22] GetData provides a simple interface to data presented asirected labeled graphs which does not assume prior knowledgef the ontological structure of the data Similarly AquaLog plu-

13 httpwwwthewinesocietycom

TE

(

Agents on the World Wide Web 5 (2007) 72ndash105 93

ins provide an API (or ontology interface) that allows us touery and access the values of the ontology properties in theocal or remote servers

The ldquolight-weightrdquo semantic information imposes a fewequirements on the ontology for AquaLog to get an answer Thentology must be structured as a directed labeled graph (taxon-my and relationships) and the requirement to get an answer ishat the ontology must be populated (triples) in such a way thathere is a short (two relations or less) direct path between theuery terms when mapped to the ontology

We illustrate these limitations with an example using the winentology We show how the requirements of AquaLog to obtainnswers to questions about matching wines and foods compareso that of a purpose built agent programmed by an expert withull knowledge of the ontology structure In the domain ontologyines are represented as individuals Mealcourses are pairingsf foods with drinks their definition ends with restrictions onhe properties of any drink that might be associated with such aourse (Fig 22) There are no instances of mealcourses linkingpecific wines with food in the ontology

A system that has been build specifically for this ontology ishe Wine Agent [26] The Wine Agent interface offers a selec-ion of types of course and individual foods as a mock-up menu

hen the user has chosen a meal the Wine Agent lists the prop-rties of a suitable wine in terms of its colour sugar (sweet orry) body and flavor The list of properties is used to generaten explanation of the inference process and to search for a listf wines that match the desired characteristics

In this way the Wine Agent can answer questions matchingood and wine by exploiting domain knowledge about the rolef restrictions which are a feature peculiar to that ontology Theethod is elegant but is only portable to ontologies that use

estrictions in the exact same way as the wine ontology does InquaLog on the other hand we were aiming to build a portable

ystem targeted to the Semantic Web Therefore we chose touild an interface for querying ontology taxonomy types prop-rties and instances ie structures which are almost universal inemantic Web ontologies AquaLog cannot reason with ontol-gy peculiarities such as the restrictions in the wine ontologyor AquaLog to get an answer the ontology would have to beopulated with mealcourses that do specify individual wines Inrder to demonstrate this in the evaluation e introduced a newet of instances of mealcourse (eg see Table 3 for an examplef instances) These provide direct paths two triples in lengthetween the query terms that allow AquaLog to answer questionsbout matching foods to wines

able 3xample of instances definition in OCML

def-instance new-course7(othertomatobasedfoodcourse((hasfood pizza) (has drinkmariettazinfadel)))

(def-instance new course8 mealcourse((hasfood fowl) (hasdrink Riesling)))

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 23: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

94 V Lopez et al Web Semantics Science Services and Agents on the World Wide Web 5 (2007) 72ndash105

f the

nmifFAw(uAttc(q

sea

TA

AO

A

A

B

7ltl

bull

Fig 22 OWL example o

ot present a complete coverage in terms of instances and ter-inology Therefore 5147 of the queries fail because the

nformation is not in the ontology As previously conceptualailures are only counted when there is not any other errorollowing this we can consider them as correctly handled byquaLog though from the point of view of a user the systemill not get an answer AquaLog resolved 1764 of questions

without conceptual failure though the ontology had to be pop-lated by us to get an answer) Therefore we can conclude thatquaLog was able to handle 5147 + 1764 = 6911 of the

otal number of questions If we allow the user to reformulatehe questions the performance of AquaLog (independently ofonceptual failures) improves to 8970 of the total questions6666 of the failures are skipped by easily reformulating theuery) All these results are presented in Table 4

Due to the lack of conceptual coverage and the few relation-hips between concepts in this ontology this is not an appropriatexperiment to show the performance of the Learning Mechanisms very little interactivity from the user is needed

able 4quaLog evaluation

quaLog success in creating a valid Onto-Triple or if it fails to complete thento-Triple is due to an ontology conceptual failureTotal number of successful queries (47 of 68) 6911Total number of successful querieswithout conceptual failure

(12 of 68) 1764

Total number of queries with onlyconceptual failure

(35 of 68) 5147

quaLog failuresTotal number failures (same query canfail for more than one reason)

(21 of 68) 3088

RSS failure (15 of 68) 22Service failure (0 of 68) 0Data model failure (0 0f 68) 0Linguistic failure (9 of 68) 1323

quaLog easily reformulated questionsTotal num queries easily reformulated (14 of 21) 6666 of failures

With conceptual failure (9 of 14) 6428

old values represent the final results ()

bull

bull

bull

7te

wine and food ontology

322 Analysis of AquaLog failures The identified AquaLogimitations are explained in Section 8 However here we presenthe common failures we encountered during our evaluation Fol-owing our classification

Linguistic failures accounted for 1323 (9 of 68) All thelinguistic failures were due to only two reasons First reasonis tokens that were not correctly annotated by GATE egnot recognizing ldquoasparagusrdquo as a noun or misunderstandingldquosmokedrdquo as a verbrelation instead of as part of a name inldquosmoked salmonrdquo similar situations occurred with terms likeldquogrilled swordfishrdquo or ldquocured hamrdquo The second cause forfailure is that it does not classify queries formed with ldquolikerdquo orldquosuch asrdquo eg ldquowhat goes well with a cheese like brierdquo Thesekind of linguistic failures are easily overcome by extendingthe set of JAPE grammars and QA abilitiesData model failure accounted for 0 (0 of 68) As in theprevious evaluation this type of failure never occurred Allquestions can be easily formulated as AquaLog triplesRSS failures accounted for 22 (15 of 68) This being themost common problem The structure of this ontology withonly a few concepts but strongly based on the use of mediatingconcepts and few direct relationships highlighted the mainRSS limitations explained in Section 8 These interestinglimitations would have to be addressed in the next Aqua-Log generation Interestingly all these RSS failures can beskipped by reformulating the query they are further explainedin the next section ldquoSimilarity failures and reformulation ofquestionsrdquoService failures accounted for 0 (0 of 69) This type offailure never occurs This was the case because the user whogenerated the questions was familiar enough with the systemand the user did not ask questions that require ranking orcomparative services

323 Similarity failures and reformulation of questions Allhe questions that fail due to only RSS shortcomings can beasily reformulated by the user into a question in which the RSS

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 24: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

ctAq

iitrcv

ioFtcmtc〈wrccmAmctctbrw8rti

iioaov

8

taaats

8

tqtAihauqaeo

sddtlhTdari

wpldquoba

booatdw

fT(ldquofr

iapldquo

V Lopez et al Web Semantics Science Service

an generate a correct ontology mapping That is 6666 ofhe total erroneous questions can be reformulated allowing thisquaLog version to be able to correctly handle 8970 of theueries

However this ontology highlighted the main AquaLog lim-tations discussed in Section 8 This provides us valuablenformation about the nature of possible extensions needed onhe RSS components We did not only want to assess the cur-ent coverage of AquaLog but also to get some data about theomplexity of the possible changes required to generate the nextersion of the system

The main underlying reason for most of the RSS failuress the structure of the wine and food ontology strongly basedn using mediating concepts instead of direct relationshipsor instance the concept ldquomealrdquo is related to ldquomealcourserdquo

hrough an ldquoad hocrdquo relation ldquomealcourserdquo is the mediatingoncept between ldquomealrdquo and all kinds of food 〈meal courseealcourse〉 〈mealcourse has-food nonredmeat〉 Moreover

he concept ldquowinerdquo is related to food only thought the con-ept ldquomealcourserdquo and its subclasses (eg ldquodessert courserdquo)mealcourse has-drink wine〉 Therefore queries like ldquowhichines should be served with a meal based on porkrdquo should be

eformulated to ldquowhich wines should be served with a meal-ourse based on porkrdquo to be answered Same applies for theoncepts ldquodessertrdquo and ldquodessert courserdquo for instance if we for-ulate the query ldquowhat can I serve with a dessert of pierdquoquaLog wrongly generates the ontology triples 〈which isadefromfruit dessert〉 and 〈dessert pie〉 it fails to find the

orrect relation for ldquodessertrdquo because it is taking the direct rela-ion ldquomadefromfruitrdquo instead of going through the mediatingoncept ldquomealcourserdquo Moreover for the second triple it failso find an ldquoad hocrdquo relation between ldquoapple pierdquo and ldquodessertrdquoecause ldquopierdquo is a subclass of ldquodessertrdquo The question is cor-ectly answered when reformulated to ldquowhat should I serveith a dessert course of apple pierdquo As explained in Section for the next AquaLog generation the RSS must be able toeclassify the triples and dynamically generate a new tripleo map indirect relationships (through a mediating concept)f needed

Other minor RSS failures are due to nominal compounds (asn the previous evaluation) eg ldquowine regionsrdquo or for instancen the basic generic query ldquowhat should we drink with a starterf seafoodrdquo AquaLog is trying to map the term ldquoseafoodrdquo inton instance however ldquoseafoodrdquo is a class in the food ontol-gy This case should be contemplated in the next AquaLogersion

Discussion limitations and future lines

We examined the nature of the AquaLog question answeringask itself and identified some major problems that we need toddress when creating a NL interface to ontologies Limitations

re due to linguistic and semantic coverage lack of appropri-te reasoning services or lack of enough mapping mechanismso interpret a query and the constraints imposed by ontologytructures

udbw

Agents on the World Wide Web 5 (2007) 72ndash105 95

1 Question types

The first limitation of AquaLog related to the type of ques-ions being handled We can distinguish different kind ofuestions affirmativenegative type of questions ldquowhrdquo ques-ions (who what how many ) and commands (Name all )ll of these are treated as questions in AquaLog As pointed out

n Ref [24] there is evidence that some kinds of questions arearder than others when querying free text For example ldquowhyrdquond ldquohowrdquo questions tend to be difficult because they requirenderstanding of causality or instrumental relations ldquoWhatrdquouestions are hard because they provide little constraint on thenswer type By contrast AquaLog can handle ldquowhatrdquo questionsasily as the possible answer types are constrained by the typesf the possible relationships in the ontology

AquaLog only supports questions referring to the presenttate of the world as represented by an ontology It has beenesigned to interface ontologies that facilitate little time depen-ent information and has limited capability to reason aboutemporal issues Although simple questions formed with ldquohow-ongrdquo or ldquowhenrdquo like ldquowhen did the akt project startrdquo can beandled it cannot cope with expressions like ldquoin the last yearrdquohis is because the structure of temporal data is domain depen-ent compare geological time scales to the periods of time inknowledge base about research projects There is a serious

esearch challenge in determining how to handle temporal datan a way which would be portable across ontologies

The current version also does not understand queries formedith ldquohow muchrdquo This is mainly because the Linguistic Com-onent does not yet fully exploit quantifier scoping [18] (ldquoeachrdquoallrdquo ldquosomerdquo) Unlike temporal data our intuition is that someasic aspects of quantification are domain independent and were working on improving this facility

In short the limitations on the types of question that cane asked of AquaLog are imposed in large part by limitationsn the kinds of generic query methods that are portable acrossntologies The use of ontologies extends its ability to ask ldquowhatrdquond ldquohowrdquo questions but the implementation of temporal struc-ures and causality (ldquowhyrdquo and ldquohowrdquo questions) is ontologyependent so these features could not be built into AquaLogithout sacrificing portabilityThere are some linguistic forms that it would be helpful

or AquaLog to handle but which we have not yet tackledhese include other languages genitive (rsquos) and negative words

such as ldquonotrdquo and ldquoneverrdquo or implicit negatives such as ldquoonlyrdquoexceptrdquo and ldquoother thanrdquo) AquaLog should also include a counteature in the triples indicating the number of instances to beetrieved to answer queries like ldquoName 30 different people rdquo

There are also some linguistic forms that we do not believe its worth concentrating on Since AquaLog is used for stand-lone questions it does not need to handle the linguistichenomenon called anaphora in which pronouns (eg ldquosherdquotheyrdquo) and possessive determiners (eg ldquohisrdquo ldquotheirsrdquo) are

sed to denote implicitly entities mentioned in an extendediscourse However some kinds of noun phrases which occureyond the same sentence are understood ie ldquois there any [ ]hothatwhich [ ]rdquo

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 25: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

9 s and

8

Ttatdratwbma

wwwtfaowrdnrb

pt〈ldquoobt

ncsittrbbntaS

oano

ldquosldquonits

iodwsomwodtc

8

d

htldquocorhitmm

tadpdTldquoeldquoldquofiLTc

6 V Lopez et al Web Semantics Science Service

2 Question complexity

Question complexity also influences the performance of QAhe system described in Ref [36] DIMAP has in average 33

riples per questions However in Ref [36] triples are formed indifferent way AquaLog has a triple for relationship between

erms even if the relation is not explicit DIMAP has a triple foriscourse entity (for each unbound variable) in which a semanticelation is not strictly required At a later stage DIMAP triplesre consolidating so that contiguous entities are made into singleriple ldquoquestions containing the phrase ldquowho was the authorrdquoere converting into ldquowho wroterdquo That pre-processing is doney the AquaLog linguistic component so that it can obtain theinimum required set of Query-Triples to represent a NL query

t once independently of how this was formulatedOn the other hand AquaLog can only deal with questions

hich generate a maximum of two triples Most of the questionse encountered in the evaluation study were not complex bute have identified a few cases which require more than two

riples The triples are fixed for each query category but weound examples in which the numbers of triples is not obvioust the first stage and may change to adapt to the semantics of thentology (dynamic creation of triples by the RSS) For instancehen dealing with compound nouns for a term like ldquoFrench

esearchersrdquo a new triple must be created when the RSS detectsuring the semantic analysis phase that it is in fact a compoundoun as there is an implicit relationship between French andesearchers The compound noun problem is discussed furtherelow

Another example is in the question ldquowhich researchers haveublications in KMi related to social aspectsrdquo where the linguis-ic triples obtained are 〈researchers have publications KMi〉which isare related social aspects〉 However the relationhave publicationsrdquo refers to ldquopublication has authorrdquo in thentology Therefore we need to represent three relationshipsetween the terms 〈research publications〉 〈publications KMi〉 〈publications related social aspects〉 therefore threeriples instead of two are needed

Along the same lines we have observed cases where termseed to be bridged by multiple triples of unknown type AquaLogannot at the moment associate linguistic relations to non-atomicemantic relations In other words it cannot infer an answerf there is not a direct relation in the ontology between thewo terms implied in the relationship even if the answer is inhe ontology For example imagine the question ldquowho collabo-ates with IBMrdquo where there is not a ldquocollaborationrdquo relationetween a person and an organization but there is a relationetween a ldquopersonrdquo and a ldquograntrdquo and a ldquograntrdquo with an ldquoorga-izationrdquo The answer is not a direct relation and the graph inhe ontology can only be found by expanding the query with thedditional term ldquograntrdquo (see also the ldquomealcourserdquo example inection 73 using the wine ontology)

With the complexity of questions as with their type the ontol-

gy design determines the questions which may be successfullynswered For example when one of the terms in the query isot represented as an instance relation or concept in the ontol-gy but as a value of a relation (string) Consider the question

nOfc

Agents on the World Wide Web 5 (2007) 72ndash105

who is the director in KMirdquo In the ontology it may be repre-ented as ldquoEnrico Motta ndash has job title ndash KMi directorrdquo whereKMi directorrdquo is a String AquaLog is not able to find it as it isot possible to check in real time all the string values for eachnstance in the ontology The information is only accessible ifhe question is reformulated to ldquowhat is the job title of Enricordquoo we would get ldquoKMi-directorrdquo as an answer

AquaLog has not yet solved the nominal compound problemdentified in Ref [3] In English nouns are often modified byther preceding nouns (ie ldquosemantic web papersrdquo ldquoresearchepartmentrdquo) A mechanism must be implemented to identifyhether a term is in fact a combination of terms which are not

eparated by a preposition Similar difficulties arise in the casef adjective-noun compounds For example a ldquolarge companyrdquoay be a company with a large volume of sales or a companyith many employees [3] Some systems require the meaningf each possible noun-noun or adjective-noun compound to beeclared during the configuration phase The configurer is askedo define the meaning of each possible compound in terms ofoncepts of the underlying domain [3]

3 Linguistic ambiguities

The current version of AquaLog has restricted facilities foretecting and dealing with questions which are ambiguous

The role of logical connectives is not fully exploited Theseave been recognized as a potential source of ambiguity in ques-ion answering Androutsopoulos et al [3] presents the examplelist all applicants who live in California and Arizonardquo In thisase the word ldquoandrdquo can be used to denote either disjunctionr conjunction introducing ambiguities which are difficult toesolve (In fact the possibility for ldquoandrdquo to denote a conjunctionere should be ruled out since an applicant cannot normally liven more than one state but this requires domain knowledge) Inhe next AquaLog version either a heuristic rule must be imple-

ented to select disjunction or conjunction or an answer for bothust be given by the systemAn additional linguistic feature not exploited by the RSS is

he use of the person and number of the verb to resolve thettachment of subordinate clauses For example consider theifference between the sentences ldquowhich academic works witheter who has interests in the semantic webrdquo and ldquowhich aca-emics work with peter who has interests in the semantic webrdquohe former example is truly ambiguousmdasheither ldquoPeterrdquo or theacademicrdquo could have an interest in the semantic web How-ver the latter can be solved if we take into account that the verbhasrdquo is in the third person singular therefore the second parthas interests in the semantic webrdquo must be linked to ldquoPeterrdquoor the sentence to be syntactically correct This approach is notmplemented in AquaLog The obvious disadvantage for Aqua-og is that userrsquos interaction will be required in both exampleshe advantage is that some robustness is obtained over syntacti-al incorrect sentences Whereas methods such as the person and

umber of the verb assume that all sentences are well-formedur experience with the sample of questions we gathered

or the evaluation study was that this is not necessarily thease

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 26: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

8

tptbmkcTcapa

lwssma

PlswtldquoLs

tciaimc

milsat

8

doiwdta

Todooistai

awtdcofirtt

pbtotfsqinotfoabaefoit

9

a[tt

V Lopez et al Web Semantics Science Service

4 Reasoning mechanisms and services

A future AquaLog version should include Ranking Serviceshat will solve questions like ldquowhat are the most successfulrojectsrdquo or ldquowhat are the five key projects in KMirdquo To answerhese questions the system must be able to carry out reasoningased on the information present in the ontology These servicesust be able to manipulate both general and ontology-dependent

nowledge as for instance the criteria which make a project suc-essful could be the number of citations or the amount fundingherefore the first problem identified is how can we make theseriteria as ontology independent as possible and available to anypplication that may need them An example of deductive com-onents with an axiomatic application domain theory to infer annswer can be found in Ref [51]

Alternatively the learning mechanism can be extended toearn typical services mentioned above that people requirehen posing queries to a semantic web Those services are not

upported by domain relationsmdashthey are essentially meta-levelervices These meta-relations can be defined by providing aechanism that allows the user to augment the system by manual

ssociation of meta-relations to domain relationsFor example if the question ldquowhat are the cool projects for a

hD studentrdquo is asked the system would not be able to solve theinguistic ambiguity of the words ldquocool projectrdquo The first timeome help from the user is needed who may select ldquoall projectshich belong to a research area in which there are many publica-

ionsrdquo A mapping is therefore created between ldquocool projectsrdquoresearch areardquo and ldquopublicationsrdquo so that the next time theearning Mechanism is called it will be able to recognize thispecific user jargon

However the notion of communities is fundamental in ordero deliver a feasible substitution service In fact another personould use the same jargon but with another meaning Letrsquos imag-ne a professor who asks the system a similar question ldquowhatre the cool projects for a professorrdquo What he means by say-ng ldquocool projectsrdquo is ldquofunding opportunityrdquo Therefore the two

appings entail two different contexts depending on the usersrsquoommunity

We also need fallback options to provide some answer whenore intelligent mechanisms fail For example when a question

s not classified by the Linguistic Component the system couldook for an ontology path between the terms recognized in theentence This might tell the user about projects that are currentlyssociated with professors This may help the user reformulatehe question about ldquocoolnessrdquo in a way the system can support

5 New research lines

While the current version of AquaLog is portable from oneomain to the other (the system being agnostic to the domainf the underlying ontology) the scope of the system is lim-ted by the amount of knowledge encoded in the ontology One

ay to overcome this limitation is to extend AquaLog in theirection of open question answering ie allowing the sys-em to combine knowledge from the wide range of ontologiesutonomously created and maintained on the semantic web

ibap

Agents on the World Wide Web 5 (2007) 72ndash105 97

his new re-implementation of AquaLog currently under devel-pment is called PowerAqua [37] In a heterogeneous openomain scenario it is not possible to determine in advance whichntologies will be relevant to a particular query Moreover it isften the case that queries can only be solved by composingnformation derived from multiple and autonomous informationources Hence to perform QA we need a system which is ableo locate and aggregate information in real time without makingny assumptions about the ontological structure of the relevantnformation

AquaLog bridges between the terminology used by the usernd the concepts used in the underlying ontology PowerAquaill function in the same way as AquaLog does with the essen-

ial difference that the terms of the question will need to beynamically mapped to several ontologies AquaLogrsquos linguisticomponent and RSS algorithms to handle ambiguity within onentology in particular regarding modifier attachment will beully reused Moreover in both AquaLog and PowerAqua theres no pre-determined assumption of the ontological structureequired for complex reasoning therefore the AquaLog evalua-ion and analysis presented here highlighted many of the issueso take into account when building an ontology-based QA

However building a multi-ontology QA system in whichortability is no longer enough and ldquoopennessrdquo is requiredrings up new challenges in comparison with AquaLog that haveo be solved by PowerAqua in order to interpret a query by meansf different ontologies These various issues are resource selec-ion heterogeneous vocabularies and combination of knowledgerom different sources Firstly we must address the automaticelection of the potentially relevant KBontologies to answer auery In the SW we envision a scenario where the user has tonteract with several large ontologies therefore indexing may beeeded to perform searches at run time Secondly user terminol-gy may be translated into semantically sound triples containingerminology distributed across ontologies in any strategy thatocuses on information content the most critical problem is thatf different vocabularies used to describe similar informationcross domains Finally queries may have to be answered noty a single knowledge source but by consulting multiple sourcesnd therefore combining the relevant information from differ-nt repositories to generate a complete ontological translationor the user queries and identifying common objects Amongther things this requires the ability to recognize whether twonstances coming from different sources may actually refer tohe same individual

Related work

QA systems attempt to allow users to ask a query in NLnd receive a concise answer possibly with validating context24] AquaLog is an ontology-based NL QA system to queryhe semantic mark-up The novelty of the system with respect toraditional QA systems is that it relies on the knowledge encoded

n the underlying ontology and its explicit semantics to disam-iguate the meaning of the questions and provide answers ands pointed on the first AquaLog conference publication [38] itrovides a new lsquotwistrsquo on the old issues of NLIDB To see to

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 27: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

9 s and

wct(us

9

gamontreo

ttdkpcoActnldquooltbqyttt

lsWuiuwccs

CID

bhf

sqdt[hfetdcTortawtt

ocsrrtodAditiaLpnto

dbtt

8 V Lopez et al Web Semantics Science Service

hat extent AquaLogrsquos novel approach facilitates the QA pro-essing and presents advantages with respect to NL interfaceso databases and open domain QA we focus on the following1) portability across domains (2) dealing with unknown vocab-lary (3) solving ambiguity (4) supported question types (5)upported question complexity

1 Closed-domain natural language interfaces

This scenario is of course very similar to asking natural lan-uage queries to databases (NLIDB) which has long been anrea of research in the artificial intelligence and database com-unities even if in the past decade it has somewhat gone out

f fashion [3] However it is our view that the SW provides aew and potentially important context in which the results fromhis area can be applied The use of natural language to accesselational databases can be traced back to the late sixties andarly seventies [3]14 In Ref [3] a detailed overview of the statef the art for these systems can be found

Most of the early NLIDBs systems were built having a par-icular database in mind thus they could not be easily modifiedo be used with different databases or were difficult to port toifferent application domains (different grammars hard-wirednowledge or mapping rules had to be developed) Configurationhases are tedious and require a long time AquaLog portabilityosts instead are almost zero Some of the early NLIDBs reliedn pattern-matching techniques In the example described byndroutsopoulos in Ref [3] a rule says that if a userrsquos request

ontains the word ldquocapitalrdquo followed by a country name the sys-em should print the capital which corresponds to the countryame so the same rule will handle ldquowhat is the capital of Italyrdquoprint the capital of Italyrdquo ldquoCould you please tell me the capitalf Italyrdquo The shallowness of the pattern-matching would oftenead to bad failures However a domain independent analog ofhis approach may be used in the next AquaLog version as a fall-ack mechanism whenever AquaLog is not able to understand auery For example if it does not recognize the structure ldquoCouldou please tell merdquo so instead of giving a failure it could get theerms ldquocapitalrdquo and ldquoItalyrdquo create a triple with them and usinghe semantics offered by the ontology look for a path betweenhem

Other approaches are based on statistical or semantic simi-arity For example FAQ Finder [9] is a natural language QAystem that uses files of FAQs as its knowledge base it also usesordNet to improve its ability to match questions to answers

sing two metrics statistical similarity and semantic similar-ty However the statistical approach is unlikely to be useful tos because it is generally accepted that only long documentsith large quantities of data have enough words for statistical

omparisons to be considered meaningful [9] which is not thease when using ontologies and KBs instead of text Semanticimilarity scores rely on finding connections through WordNet

14 Example of such systems are LUNAR RENDEZVOUS LADDERHAT-80 TEAM ASK JANUS INTELLECT BBnrsquos PARLANCE

BMrsquos LANGUAGEACCESS QampA Symantec NATURAL LANGUAGEATATALKER LOQUI ENGLISH WIZARD MASQUE

alAf

Ciwt

Agents on the World Wide Web 5 (2007) 72ndash105

etween the userrsquos question and the answer The main problemere is the inability to cope with words that apparently are notound in the KB

The next generation of NLIDBs used an intermediate repre-entation language which expressed the meaning of the userrsquosuestion in terms of high-level concepts independent of theatabase structure [3] For instance the approach of the NL sys-em for databases based on formal semantics presented in Ref17] made a clear separation between the NL front ends whichave a very high degree of portability and the back end Theront end provides a mapping between sentences of English andxpressions of a formal semantic theory and the back end mapshese into expressions which are meaningful with respect to theomain in question Adapting a developed system to a new appli-ation will only involve altering the domain specific back endhe main difference between AquaLog and the latest generationf NLIDB systems [17] is that AquaLog uses an intermediateepresentation through the entire process from the representa-ion of the userrsquos query (NL front end) to the representation ofn ontology compliant triple (through similarity services) fromhich an answer can be directly inferred It takes advantage of

he use of ontologies and generic resources in a way that makeshe entire process highly portable

TEAM [39] is an experimental transportable NLIDB devel-ped in the 1980s The TEAM QA system like AquaLogonsists of two major components (1) for mapping NL expres-ions into formal representations (2) for transforming theseepresentations into statements of a database making a sepa-ation of the linguistic process and the mapping process ontohe KB To improve portability TEAM requires ldquoseparationf domain-dependent knowledge (to be acquired for each newatabase) from the domain-independent parts of the systemrdquoquaLog presents an elegant solution in which the domain-ependent knowledge is obtained through a learning mechanismn an automatic way Also all the domain dependent configura-ion parameters like ldquowhordquo is the same as ldquopersonrdquo are specifiedn a XML file In TEAM the logical form constructed constitutesn unambiguous representation of the English query In Aqua-og NL ambiguity is taken into account so if the linguistichase is not able to disambiguate it the ambiguity goes into theext phase the relation similarity service which is an interac-ive service that tries to disambiguate using the semantics in thentology or otherwise by asking for user feedback

MASQUESQL [2] is a portable NL front-end to SQLatabases The semi-automatic configuration procedure uses auilt-in domain editor which helps the user to describe the entityypes to which the database refers using an is-a hierarchy andhen to declare the words expected to appear in the NL questionsnd to define their meaning in terms of a logic predicate that isinked to a database tableview In contrast with MASQUESQLquaLog uses the ontology to describe the entities with no need

or an intensive configuration procedureMore recent work in the area can be found in Ref [46] PRE-

ISE [46] maps questions to the corresponding SQL query bydentifying classes of questions that are easy to understand in aell defined sense the paper defines a formal notion of seman-

ically tractable questions Questions are sets of attributevalue

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 28: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

powLtduhtPuaswtoCAbdn

9

rcTdp

tsmcscaib

ca(lmqtqivss

Ad

ao[

sdmariaqAt[bfk

skupldquo

cAwwaIdtomg(qmet

V Lopez et al Web Semantics Science Service

airs and a relation token corresponds to either an attribute tokenr a value token Each attribute in the database is associatedith a wh-value (what where etc) In PRECISE like in Aqua-og a lexicon is used to find synonyms However in PRECISE

he problem of finding a mapping from the tokenization to theatabase requires that all tokens must be distinct questions withnknown words are not semantically tractable and cannot beandled In other words PRECISE will not answer a questionhat contains words absent from its lexicon In contrast withRECISE AquaLog employs similarity services to interpret theserrsquos query by means of the vocabulary in the ontology Asconsequence AquaLog is able to reason about the ontology

tructure in order to make sense of unknown relations or classeshich appear not to have any match in the KB or ontology Using

he example suggested in Ref [46] the question ldquowhat are somef the neighborhoods of Chicagordquo cannot be handled by PRE-ISE because the word ldquoneighborhoodrdquo is unknown HoweverquaLog would not necessarily know the term ldquoneighborhoodrdquout it might know that it must look for the value of a relationefined for cities In many cases this information is all AquaLogeeds to interpret the query

2 Open-domain QA systems

We have already pointed out that research in NLIDB is cur-ently a bit lsquodormantrsquo therefore it is not surprising that mosturrent work on QA which has been rekindled largely by theext Retrieval Conference15 (see examples below) is somewhatifferent in nature from AquaLog However there are linguisticroblems common in most kinds of NL understanding systems

Question answering applications for text typically involvewo steps as cited by Hirschman [24] (1) ldquoIdentifying theemantic type of the entity sought by the questionrdquo (2) ldquoDeter-ining additional constraints on the answer entityrdquo Constraints

an include for example keywords (that may be expanded usingynonyms or morphological variants) to be used in matchingandidate answers and syntactic or semantic relations betweencandidate answer entity and other entities in the question Var-

ous systems have therefore built hierarchies of question typesased on the types of answers sought [43482553]

For instance in LASSO [43] a question type hierarchy wasonstructed from the analysis of the TREC-8 training data Givenquestion it can find automatically (a) the type of question

what why who how where) (b) the type of answer (personocation ) (c) the question focus defined as the ldquomain infor-ation required by the interrogationrdquo (very useful for ldquowhatrdquo

uestions which say nothing about the information asked for byhe question) Furthermore it identifies the keywords from theuestion Occasionally some words of the question do not occurn the answer (for example if the focus is ldquoday of the weekrdquo it is

ery unlikely to appear in the answer) Therefore it implementsimilar heuristics to the ones used by named entity recognizerystems for locating the possible answers

15 sponsored by the American National Institute (NIST) and the Defensedvanced Research Projects Agency (DARPA) TREC introduces and open-omain QA track in 1999 (TREC-8)

drit

bIc

Agents on the World Wide Web 5 (2007) 72ndash105 99

Named entity recognition and information extraction (IE)re powerful tools in question answering One study showed thatver 80 of questions asked for a named entity as a response48] In that work Srihari and Li argue that

ldquo(i) IE can provide solid support for QA (ii) Low-level IEis often a necessary component in handling many types ofquestions (iii) A robust natural language shallow parser pro-vides a structural basis for handling questions (iv) High-leveldomain independent IE is expected to bring about a break-through in QArdquo

Where point (iv) refers to the extraction of multiple relation-hips between entities and general event information like WHOid WHAT AquaLog also subscribes to point (iii) however theain two differences between open-domains systems and ours

re (1) it is not necessary to build hierarchies or heuristics toecognize named entities as all the semantic information neededs in the ontology (2) AquaLog has already implemented mech-nisms to extract and exploit the relationships to understand auery Nevertheless the goal of the main similarity service inquaLog the RSS is to map the relationships in the linguistic

riple into an ontology-compliant-triple As described in Ref48] NE is necessary but not complete in answering questionsecause NE by nature only extracts isolated individual entitiesrom text therefore methods like ldquothe nearest NE to the queriedey wordsrdquo [48] are used

Both AquaLog and open-domain systems attempt to findynonyms plus their morphological variants for the terms oreywords Also in both cases at times the rules keep ambiguitynresolved and produce non-deterministic output for the askingoint (for instance ldquowhordquo can be related to a ldquopersonrdquo or to anorganizationrdquo)

As in open-domain systems AquaLog also automaticallylassifies the question before hand The main difference is thatquaLog classifies the question based on the kind of triplehich is a semantic equivalent representation of the questionhile most of the open-domain QA systems classify questions

ccording to their answer target For example in Wu et alrsquosLQUA system [53] these are categories like person locationate region or subcategories like lake river In AquaLog theriple contains information not only about the answer expectedr focus which is what we call the generic term or ground ele-ent of the triple but also about the relationships between the

eneric term and the other terms participating in the questioneach relationship is represented in a different triple) Differentueries formed by ldquowhat isrdquo ldquowhordquo ldquowhererdquo ldquotell merdquo etcay belong to the same triple category as they can be differ-

nt ways of asking the same thing An efficient system shouldherefore group together equivalent question types [25] indepen-ently of how the query is formulated eg a basic generic tripleepresents a relationship between a ground term and an instancen which the expected answer is a list of elements of the type ofhe ground term that occur in the relationship

The best results of the TREC9 [16] competition were obtainedy the FALCON system described in Harabagiu et al [23]n FALCON the answer semantic categories are mapped intoategories covered by a Named Entity Recognizer When the

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 29: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

1 s and

qmnpotoatCbgaw

dsSabapiaowwsmtbtppptta

tlta

9

A

WotAt

Mildquoealdquoevtldquo

eadtOawitgettlttplihsttdengautct

sba

00 V Lopez et al Web Semantics Science Service

uestion concept indicating the answer type is identified it isapped into an answer taxonomy The top categories are con-

ected to several word classes from WordNet In an exampleresented in [23] FALCON identifies the expected answer typef the question ldquowhat do penguins eatrdquo as food because ldquoit ishe most widely used concept in the glosses of the subhierarchyf the noun synset eating feedingrdquo All nouns (and lexicallterations) immediately related to the concept that determineshe answer type are considered among the keywords Also FAL-ON gives a cached answer if the similar question has alreadyeen asked before a similarity measure is calculated to see if theiven question is a reformulation of a previous one A similarpproach is adopted by the learning mechanism in AquaLoghere the similarity is given by the context stored in the tripleSemantic web searches face the same problems as open

omain systems in regards to dealing with heterogeneous dataources on the Semantic Web Guha et al [22] argue that ldquoin theemantic Web it is impossible to control what kind of data isvailable about any given object [ ] Here there is a need toalance the variation in the data availability by making sure thatt a minimum certain properties are available for all objects Inarticular data about the kind of object and how is referred toe rdftype and rdfslabelrdquo Similarly AquaLog makes simplessumptions about the data which is understood as instancesr concepts belonging to a taxonomy and having relationshipsith each other In the TAP system [22] search is augmentingith data from the SW To determine the concept denoted by the

earch query the first step is to map the search term to one orore nodes of the SW (this might return no candidate denota-

ions a single match or multiple matches) A term is searchedy using its rdfslabel or one of the other properties indexed byhe search interface In ambiguous cases it chooses based on theopularity of the term (frequency of occurrence in a text cor-us the user profile the search context) or by letting the userick the right denotation The node which is the selected deno-ation of the search term provides a starting point Triples inhe vicinity of each of these nodes are collected As Ref [22]rgues

ldquothe intuition behind this approach is that proximity inthe graph reflects mutual relevance between nodes Thisapproach has the advantage of not requiring any hand-codingbut has the disadvantage of being very sensitive to the rep-resentational choices made by the source on the SW [ ] Adifferent approach is to manually specify for each class objectof interest the set of properties that should be gathered Ahybrid approach has most of the benefits of both approachesrdquo

In AquaLog the vocabulary problem is solved (ontologyriples mapping) not only by looking at the labels but also byooking at the lexically related words and the ontology (con-ext of the triple terms and relationships) and in the last resortmbiguity is solved by asking the user

3 Open-domain QA systems using triple representations

Other NL search engines such as AskJeeves [4] and Easysk [20] exist which provide NL question interfaces to the

mnwi

Agents on the World Wide Web 5 (2007) 72ndash105

eb but retrieve documents not answers AskJeeves reliesn human editors to match question templates with authori-ative sites systems such as START [31] REXTOR [32] andnswerBus [54] whose goal is also to extract answers from

extSTART focuses on questions about geography and the

IT infolab AquaLogrsquos relational data model (triple-based)s somehow similar to the approach adopted by START calledobject-property-valuerdquo The difference is that instead of prop-rties we are looking for relations between terms or betweenterm and its value Using an example presented in Ref [31]

what languages are spoken in Guernseyrdquo for START the prop-rty is ldquolanguagesrdquo between the Object ldquoGuernseyrdquo and thealue ldquoFrenchrdquo for AquaLog it will be translated into a rela-ion ldquoare spokenrdquo between a term ldquolanguagerdquo and a locationGuernseyrdquo

The system described in Litkowski [36] called DIMAPxtracts ldquosemantic relation triplesrdquo after the document is parsednd the parse tree is examined The DIMAP triples are stored in aatabase in order to be used to answer the question The seman-ic relation triple described consists of a discourse entity (SUBJBJ TIME NUM ADJMOD) a semantic relation that ldquochar-

cterizes the entityrsquos role in the sentencerdquo and a governing wordhich is ldquothe word in the sentence that the discourse entity stood

n relation tordquo The parsing process generated an average of 98riples per sentence The same analysis was for each questionenerating on average 33 triples per sentence with one triple forach question containing an unbound variable corresponding tohe type of question DIMAP-QA converts the document intoriples AquaLog uses the ontology which may be seen as a col-ection of triples One of the current AquaLog limitations is thathe number of triples is fixed for each query category althoughhe AquaLog triples change during its life cycle However theerformance is still high as most of the questions can be trans-ated into one or two triples Apart from that the AquaLog triples very similar to the DIMAP semantic relation triples AquaLogas a triple for relationship between terms even if the relation-hip is not explicit DIMAP has a triple for discourse entity Ariple is generally equivalent to a logical form (where the opera-or is the semantic relation though is not strictly required) Theiscourse entities are the driving force in DIMAP triples keylements (key nouns key verbs and any adjective or modifieroun) are determined for each question type The system cate-orized questions in six types time location who what sizend number questions In AquaLog the discourse entity or thenbound variable can be equivalent to any of the concepts inhe ontology (it does not affect the category of the triple) theategory of the triple being the driving force which determineshe type of processing

PiQASso [5] uses a ldquocoarse-grained question taxonomyrdquo con-isting of person organization time quantity and location asasic types plus 23 WordNet top-level noun categories Thenswer type can combine categories For example in questions

ade with who where the answer type is a person or an orga-

ization Categories can often be determined directly from ah-word ldquowhordquo ldquowhenrdquo ldquowhererdquo In other cases additional

nformation is needed For example with ldquohowrdquo questions the

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 30: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

cmfst〈ldquotobstasatttldquosavb

9

omitaacEiattsotthaTrtKcTaetaAd

dwldquoors

tattboqataAippsvuh

9

Ld[cabdicLippt[itccAbiid

V Lopez et al Web Semantics Science Service

ategory is found from the adjective following ldquohowrdquo (ldquohowanyrdquo or ldquohow muchrdquo) for quantity ldquohow longrdquo or ldquohow oldrdquo

or time etc The type of ldquowhat 〈noun〉rdquo question is normally theemantic type of the noun which is determined by WNSense aool for classifying word senses Questions of the form ldquowhatverb〉rdquo have the same answer type as the object of the verbWhat isrdquo questions in PiQASso also rely on finding the seman-ic type of a query word However as the authors say ldquoit isften not possible to just look up the semantic type of the wordecause lack of context does not allow identifying (sic) the rightenserdquo Therefore they accept entities of any type as answerso definition questions provided they appear as the subject inn ldquois-ardquo sentence If the answer type can be determined for aentence it is submitted to the relation matching filter during thisnalysis the parser tree is flattened into a set of triples and cer-ain relations can be made explicit by adding links to the parserree For instance in Ref [5] the example ldquoman first walked onhe moon in 1969rdquo is presented in which ldquo1969rdquo depends oninrdquo which in turn depends on ldquomoonrdquo Attardi et al proposehort circuiting the ldquoinrdquo by adding a direct link between ldquomoonrdquond ldquo1969rdquo following a rule that says that ldquowhenever two rele-ant nodes are linked through an irrelevant one a link is addedetween themrdquo

4 Ontologies in question answering

We have already mentioned that many systems simply use anntology as a mechanism to support query expansion in infor-ation retrieval In contrast with these systems AquaLog is

nterested in providing answers derived from semantic anno-ations to queries expressed in NL In the paper by Basili etl [6] the possibility of building an ontology-based questionnswering system in the context of the semantic web is dis-ussed Their approach is being investigated in the context ofU project MOSES with the ldquoexplicit objective of develop-

ng an ontology-based methodology to search create maintainnd adapt semantically structured Web contents according tohe vision of semantic webrdquo As part of this project they plano investigate whether and how an ontological approach couldupport QA across sites They have introduced a classificationf the questions that the system is expected to support and seehe content of a question largely in terms of concepts and rela-ions from the ontology Therefore the approach and scenarioas many similarities with AquaLog However AquaLog isn implemented on-line system with wider linguistic coveragehe query classification is guided by the equivalent semantic

epresentations or triples The mapping process is convertinghe elements of the triple into entry-points to the ontology andB Also there is a clear differentiation between the linguistic

omponent and the ontology-base relation similarity serviceshe goal of the next generation of AquaLog is to use all avail-ble ontologies on the SW to produce an answer to a query asxplained in Section 85 Basili et al [6] say that they will inves-

igate how an ontological approach could support QA across

ldquofederationrdquo of sites within the same domain In contrastquaLog will not assume that the ontologies refer to the sameomain they can have overlapping domains or may refer to

ba

O

Agents on the World Wide Web 5 (2007) 72ndash105 101

ifferent domains But similarly to [6] AquaLog has to dealith the heterogeneity introduced by ontologies themselves

since each node has its own version of the domain ontol-gy the task of passing a question from node to node may beeduced to a mapping task between (similar) conceptual repre-entationsrdquo

The knowledge-based approach described in Ref [12] iso ldquoaugment on-line text with a knowledge-based question-nswering component [ ] allowing the system to infer answerso usersrsquo questions which are outside the scope of the prewrittenextrdquo It assumes that the knowledge is stored in a knowledgease and structured as an ontology of the domain Like manyf the systems we have seen it has a small collection of genericuestion types which it knows how to answer Question typesre associated with concepts in the KB For a given concepthe question types which are applicable are those which arettached either to the concept itself or to any of its superclassesdifference between this system and others we have seen is the

mportance it places on building a scenario part assumed andart specified by the user via a dialogue in which the systemrompts the user with forms and questions based on an answerchema that relates back to the question type The scenario pro-ides a context in which the system can answer the query Theser input is thus considerably greater than we would wish toave in AquaLog

5 Conclusions of existing approaches

Since the development of the first ontological QA systemUNAR (a syntax-based system where the parsed question isirectly mapped to a database expression by the use of rules3]) there have been improvements in the availability of lexi-al knowledge bases such as WordNet and shallow modularnd robust NLP systems like GATE Furthermore AquaLog isased on the vision of a Web populated by ontologically taggedocuments Many closed domain NL interfaces are very richn NL understanding and can handle questions that are moreomplex than the ones handled by the current version of Aqua-og However AquaLog has a very light and extendable NL

nterface that allows it to produce triples after only shallowarsing The main difference between the systems is related toortability Later NLDBI systems use intermediate representa-ions therefore although the front end is portable (as Copestake14] states ldquothe grammar is to a large extent general purposet can be used for other domains and for other applicationsrdquo)he back end is dependent on the database so normally longeronfiguration times are required AquaLog in contrast withlosed domain systems is completely portable Moreover thequaLog disambiguation techniques are sufficiently general toe applied in different domains In AquaLog disambiguations regarded as part of the translation process if the ambigu-ty is not solved by domain knowledge then the ambiguity isetected and the user is consulted before going ahead (to choose

etween alternative reading on terms relations or modifierttachment)

AquaLog is complementary to open domain QA systemspen QA systems use the Web as the source of knowledge

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 31: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

1 s and

atcoiqHotpuobqaBsmcmdmOtr

qbcwnittIatp

stoaa

1

asQtunsio

lreqomaentj

adHaaAalbo

A

ntpsCsmnmpwSp

At

w

W

Name the planet stories thatare related to akt

〈planet stories relatedto akt〉

〈kmi-planet-news-itemmentions-project akt〉

02 V Lopez et al Web Semantics Science Service

nd provide answers in the form of selected paragraphs (wherehe answer actually lies) extracted from very large open-endedollections of unstructured text The key limitation of anntology-based system (as for closed-domain systems) is thatt presumes the knowledge the system is using to answer theuestion is in a structured knowledge base in a limited domainowever AquaLog exploits the power of ontologies as a modelf knowledge and the availability of semantic markup offered byhe SW to give precise focused answers rather than retrievingossible documents or pre-written paragraphs of text In partic-lar semantic markup facilitates queries where multiple piecesf information (that may come from different sources) need toe inferred and combined together For instance when we ask auery such as ldquowhat are the homepages of researchers who haven interest in the Semantic Webrdquo we get the precise answerehind the scenes AquaLog is not only able to correctly under-

tand the question but is also competent to disambiguate multipleatches of the term ldquoresearchersrdquo on the SW and give back the

orrect answers by consulting the ontology and the availableetadata As Basili argues in Ref [6] ldquoopen domain systems

o not rely on specialized conceptual knowledge as they use aixture of statistical techniques and shallow linguistic analysisntological Question Answering Systems [ ] propose to attack

he problem by means of an internal unambiguous knowledgeepresentationrdquo

Both open-domain QA systems and AquaLog classifyueries in AquaLog based on the triple format in open domainased on the expected answer (person location) or hierar-hies of question types based on types of answer sought (whathy who how where ) The AquaLog classification includesot only information about the answer expected (a list ofnstances an assertion etc) but also depending on the tripleshey generate (number of triples explicit versus implicit rela-ionships or query terms) and how they should be resolvedt groups different questions that can be presented by equiv-lent triples For instance the query ldquoWho works in aktrdquo ishe same as ldquoWhich are the researchers involved in the aktrojectrdquo

We believe that the main benefit of an ontology-based QAystem on the SW when compared to other kind of QA sys-ems is that it can use the domain knowledge provided by thentology to cope with words apparently not found in the KBnd to cope with ambiguity (mapping vocabulary or modifierttachment)

0 Conclusions

In this paper we have described the AquaLog questionnswering system emphasizing its genesis in the context ofemantic web research The key ingredient to ontology-basedA systems and to the most accurate non-ontological QA sys-

ems is the ability to capture the semantics of the question andse it in the extraction of answers in real time AquaLog does

ot assume that the user has any prior information about theemantic resource AquaLogrsquos requirement of portability makest impossible to have any pre-formulated assumptions about thentological structure of the relevant information and the prob-

W

Agents on the World Wide Web 5 (2007) 72ndash105

em cannot be addressed by the specification of static mappingules AquaLog presents an elegant solution in which differ-nt strategies are combined together to make sense of an NLuery which respect to the universe of discourse covered by thentology It provides precise answers to complex queries whereultiple pieces of information need to be combined together

t run time AquaLog makes sense of query termsrelationsxpressed in terms familiar to the user even when they appearot to have any match Moreover the performance of the sys-em improves over time in response to a particular communityargon

Finally its ontology portability capabilities make AquaLogsuitable NL front-end for a semantic intranet where a sharedynamic organizational ontology is used to describe resourcesowever if we consider the Semantic Web in the large there isneed to compose information from multiple resources that areutonomously created and maintained Our future directions onquaLog and ontology-based QA are evolving from relying onsingle ontology at a time to opening up to harvest the rich onto-

ogical knowledge available on the Web allowing the system toenefit from and combine knowledge from the wide range ofntologies that exist on the Web

cknowledgements

This work was funded by the Advanced Knowledge Tech-ologies (AKT) Interdisciplinary Research Collaboration (IRC)he Designing Information Extraction for KM (DotKom)roject and the Open Knowledge (OK) project AKT is spon-ored by the UK Engineering and Physical Sciences Researchouncil under grant number GRN1576401 DotKom was

ponsored by the European Commission as part of the Infor-ation Society Technologies (IST) programme under grant

umber IST-2001034038 OK is sponsored by European Com-ission as part of the Information Society Technologies (IST)

rogramme under grant number IST-2001-34038 The authorsould like to thank Yuangui Lei for technical input and Martaabou for all her useful input and her valuable revision of thisaper

ppendix A Examples of NL queries and equivalentriples

h-generic term Linguistic triple Ontology triplea

ho are the researchers inthe semantic webresearch area

〈personorganizationresearchers semanticweb research area〉

〈researcherhas-research-interestsemantic-web-area〉

hat are the phd studentsworking for buddy space

〈phd studentsworking buddyspace〉

〈phd-studenthas-project-memberhas-project-leaderbuddyspace〉

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 32: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

A

w

S

W

w

A

W

D

W

W

A

I

H

w

D

t

w

W

w

I

w

W

w

d

w

W

w

W

W

G

w

W

compendium haveinterest inhypermedia

〈researchershas-interesthypermedia〉

compendium〉〈researcherhas-research-interest

V Lopez et al Web Semantics Science Service

ppendix A (Continued )

h-unknown term Linguistic triple Ontology triple

how me the job titleof Peter Scott

〈which is job titlepeter scott〉

〈which ishas-job-titlepeter-scott〉

hat are the projectsof Vanessa

〈which is projectsvanessa〉

〈project has-project-memberhas-project-leader vanessa〉

h-unknown relation Linguistic triple Ontology triple

re there any projectsabout semanticweb

〈projects semanticweb〉

〈projectaddresses-generic-area-of-interestsemantic-web-area〉

here is theKnowledge MediaInstitute

〈location knowledge mediainstitute〉

No relations found inthe ontology

escription Linguistic triple Ontology triple

ho are theacademics

〈whowhat is academics〉

〈whowhat is academic-staff-member〉

hat is an ontology 〈whowhat is ontology〉

〈whowhat is ontologies〉

ffirmativenegative Linguistic triple Ontology triple

s Liliana a phdstudent in the dipproject

〈liliana phd studentdip project〉

〈liliana-cabral(phd-student)has-project-member dip〉

as Martin Dzbor anyinterest inontologies

〈martin dzbor has anyinterest ontologies〉

〈martin-dzborhas-research-interestontologies〉

h-3 terms Linguistic triple Ontology triple

oes anybody workin semantic web onthe dotkom project

〈personorganizationworks semantic webdotkom project〉

〈personhas-research-interestsemantic-web-area〉〈person has-project-memberhas-project-leader dot-kom〉

ell me all the planetnews written byresearchers in akt

〈planet news writtenresearchers akt〉

〈kmi-planet-news-itemhas-author researcher〉〈researcher has-project-memberhas-project-leader akt〉

h-3 terms (firstclause)

Linguistic triple Ontology triple

hich projects oninformationextraction aresponsored by eprsc

〈projects sponsoredinformationextraction epsrc〉

〈project addresses-generic-area-of-interestinformation-extraction〉〈projectinvolves-organizationepsrc〉

h-3 terms (unknownrelation)

Linguistic triple Ontology triple

s there anypublication about

〈publication magpie akt〉

publicationmentions-projecthas-key-

magpie in akt publicationhas-publication akt〉〈publicationmentions-project akt〉

Agents on the World Wide Web 5 (2007) 72ndash105 103

h-unknown term(with clause)

Linguistic triple Ontology triple

hat are the contactdetails from theKMi academics incompendium

〈which is contactdetails kmiacademicscompendium〉

〈which ishas-web-addresshas-email-addresshas-telephone-numberkmi-academic-staff-member〉〈kmi-academic-staff-memberhas-project-membercompendium〉

h-combination (and) Linguistic triple Ontology triple

oes anyone hasinterest inontologies and is amember of akt

〈personorganizationhas interestontologies〉〈personorganizationmember akt〉

〈personhas-research-interestontologies〉 〈research-staff-memberhas-project-memberakt〉

h-combination (or) Linguistic triple Ontology triple

hich academicswork in akt or indotcom

〈academics workakt〉 〈academicswork dotcom〉

〈academic-staff-memberhas-project-memberakt〉 〈academic-staff-memberhas-project-memberdot-kom〉

h-combinationconditioned

Linguistic triple Ontology triple

hich KMiacademics work inthe akt projectsponsored byepsrc

〈kmi academics workakt project〉 〈which issponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈projectinvolves-organizationepsrc〉

hich KMiacademics workingin the akt projectare sponsored byepsrc

〈kmi academicsworking akt project〉〈kmi academicssponsored epsrc〉

〈kmi-academic-staff-memberhas-project-memberakt〉 〈kmi-academic-staff-memberhas-affiliated-personepsrc〉

ive me thepublications fromphd studentsworking inhypermedia

〈which ispublications phdstudents〉 〈which isworking hypermedia〉

publicationmentions-personhas-author phd student〉〈phd-studenthas-research-interesthypermedia〉

h-generic withwh-clause

Linguistic triple Ontology triple

hat researcherswho work in

〈researchers workcompendium〉

〈researcherhas-project-member

hypermedia〉

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 33: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

1 s and

A

2

W

W

R

[

[

[

[

[

[[

[

[

[[

[

[

[

[

[

[[

[[

[

[

[

[

[

[

[

[

[

04 V Lopez et al Web Semantics Science Service

ppendix A (Continued )

-patterns Linguistic triple Ontology triple

hat is the homepageof Peter who hasinterest in semanticweb

〈which is homepagepeter〉〈personorganizationhas interest semanticweb〉

〈which ishas-web-addresspeter-scott〉 〈personhas-research-interestsemantic-web-area〉

hich are the projectsof academics thatare related to thesemantic web

〈which is projectsacademics〉 〈which isrelated semantic web〉

〈projecthas-project-memberacademic-staff-member〉〈academic-staff-memberhas-research-interestsemantic-web-area〉

a Using the KMi ontology

eferences

[1] AKT Reference Ontology httpkmiopenacukprojectsaktref-ontoindexhtml

[2] I Androutsopoulos GD Ritchie P Thanisch MASQUESQLmdashan effi-cient and portable natural language query interface for relational databasesin PW Chung G Lovegrove M Ali (Eds) Proceedings of the 6thInternational Conference on Industrial and Engineering Applications ofArtificial Intelligence and Expert Systems Edinburgh UK Gordon andBreach Publishers 1993 pp 327ndash330

[3] I Androutsopoulos GD Ritchie P Thanisch Natural language interfacesto databasesmdashan introduction Nat Lang Eng 1 (1) (1995) 29ndash81

[4] AskJeeves AskJeeves httpwwwaskcouk[5] G Attardi A Cisternino F Formica M Simi A Tommasi C Zavat-

tari PIQASso PIsa question answering system in Proceedings of thetext Retrieval Conferemce (Trec-10) 599-607 NIST Gaithersburg MDNovember 13ndash16 2001

[6] R Basili DH Hansen P Paggio MT Pazienza FM Zanzotto Ontolog-ical resources and question data in answering in Workshop on Pragmaticsof Question Answering held jointly with NAACL Boston MA May 2004

[7] T Berners-Lee J Hendler O Lassila The semantic web Sci Am 284 (5)(2001)

[8] J Burger C Cardie V Chaudhri et al Tasks and Program Structuresto Roadmap Research in Question amp Answering (QampA) NIST Tech-nical Report 2001 httpwwwaimitedupeoplejimmylin0ApapersBurger00-Roadmappdf

[9] RD Burke KJ Hammond V Kulyukin Question answering fromfrequently-asked question files experiences with the FAQ finder systemTech Rep TR-97-05 Department of Computer Science University ofChicago 1997

10] J Chu-Carroll D Ferrucci J Prager C Welty Hybridization in questionanswering systems in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

12] P Clark J Thompson B Porter A knowledge-based approach to question-answering in In the AAAI Fall Symposium on Question-AnsweringSystems CA AAAI 1999 pp 43ndash51

13] WW Cohen P Ravikumar SE Fienberg A comparison of string distancemetrics for name-matching tasks in IIWeb Workshop 2003 httpwww-2cscmuedusimwcohenpostscriptijcai-ws-2003pdf

14] A Copestake KS Jones Natural language interfaces to databases KnowlEng Rev 5 (4) (1990) 225ndash249

15] H Cunningham D Maynard K Bontcheva V Tablan GATE a frame-work and graphical development environment for robust NLP tools and

applications in Proceedings of the 40th Anniversary Meeting of the Asso-ciation for Computational Linguistics (ACLrsquo02) Philadelphia 2002

16] De Boni M TREC 9 QA track overview17] AN De Roeck CJ Fox BGT Lowden R Turner B Walls A natu-

ral language system based on formal semantics in Proceedings of the

[

[

Agents on the World Wide Web 5 (2007) 72ndash105

International Conference on Current Issues in Computational LinguisticsPengang Malaysia 1991

18] Discourse Represenand then tation Theory Jan Van Eijck to appear in the2nd edition of the Encyclopedia of Language and Linguistics Elsevier2005

19] M Dzbor J Domingue E Motta Magpiemdashtowards a semantic webbrowser in Proceedings of the 2nd International Semantic Web Con-ference (ISWC2003) Lecture Notes in Computer Science 28702003Springer-Verlag 2003

20] EasyAsk httpwwweasyaskcom21] C Fellbaum (Ed) WordNet An Electronic Lexical Database Bradford

Books May 199822] R Guha R McCool E Miller Semantic search in Proceedings of the

12th International Conference on World Wide Web Budapest Hungary2003

23] S Harabagiu D Moldovan M Pasca R Mihalcea M Surdeanu RBunescu R Girju V Rus P Morarescu Falconmdashboosting knowledgefor answer engines in Proceedings of the 9th Text Retrieval Conference(Trec-9) Gaithersburg MD November 2000

24] L Hirschman R Gaizauskas Natural Language question answering theview from here Nat Lang Eng 7 (4) (2001) 275ndash300 (special issue onQuestion Answering)

25] EH Hovy L Gerber U Hermjakob M Junk C-Y Lin Question answer-ing in Webclopedia in Proceedings of the TREC-9 Conference NISTGaithersburg MD 2000

26] E Hsu D McGuinness Wine agent semantic web testbed applicationWorkshop on Description Logics 2003

27] A Hunter Natural language database interfaces Knowl Manage (2000)28] H Jung G Geunbae Lee Multilingual question answering with high porta-

bility on relational databases IEICE Trans Inform Syst E86-D (2) (2003)306ndash315

29] JWNL (Java WordNet library) httpsourceforgenetprojectsjwordnet30] J Kaplan Designing a portable natural language database query system

ACM Trans Database Syst 9 (1) (1984) 1ndash1931] B Katz S Felshin D Yuret A Ibrahim J Lin G Marton AJ McFar-

land B Temelkuran Omnibase uniform access to heterogeneous data forquestion answering in Proceedings of the 7th International Workshop onApplications of Natural Language to Information Systems (NLDB) 2002

32] B Katz J Lin REXTOR a system for generating relations from nat-ural language in Proceedings of the ACL-2000 Workshop of NaturalLanguage Processing and Information Retrieval (NLP(IR)) 2000

33] B Katz J Lin Selectively using relations to improve precision in ques-tion answering in Proceedings of the EACL-2003 Workshop on NaturalLanguage Processing for Question Answering 2003

34] D Klein CD Manning Fast Exact Inference with a Factored Model forNatural Language Parsing Adv Neural Inform Process Syst 15 (2002)

35] Y Lei M Sabou V Lopez J Zhu V Uren E Motta An infrastructurefor acquiring high quality semantic metadata in Proceedings of the 3rdEuropean Semantic Web Conference Montenegro 2006

36] KC Litkowski Syntactic clues and lexical resources in question-answering in EM Voorhees DK Harman (Eds) InformationTechnology The Ninth TextREtrieval Conferenence (TREC-9) NIST Spe-cial Publication 500-249 National Institute of Standards and TechnologyGaithersburg MD 2001 pp 157ndash166

37] V Lopez E Motta V Uren PowerAqua fishing the semantic web inProceedings of the 3rd European Semantic Web Conference MonteNegro2006

38] V Lopez E Motta Ontology driven question answering in AquaLog inProceedings of the 9th International Conference on Applications of NaturalLanguage to Information Systems Manchester England 2004

39] P Martin DE Appelt BJ Grosz FCN Pereira TEAM an experimentaltransportable naturalndashlanguage interface IEEE Database Eng Bull 8 (3)(1985) 10ndash22

40] D Mc Guinness F van Harmelen OWL Web Ontology LanguageOverview W3C Recommendation 10 2004 httpwwww3orgTRowl-features

41] D Mc Guinness Question answering on the semantic web IEEE IntellSyst 19 (1) (2004)

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References
Page 34: AquaLog: An ontology-driven question answering system for ...technologies.kmi.open.ac.uk/aqualog/aqualog_journal.pdfAquaLog is a portable question-answering system which takes queries

s and

[[

[

[

[[

[

[

[

[[

V Lopez et al Web Semantics Science Service

42] TM Mitchell Machine Learning McGraw-Hill New York 199743] D Moldovan S Harabagiu M Pasca R Mihalcea R Goodrum R Girju

V Rus LASSO a tool for surfing the answer net in Proceedings of theText Retrieval Conference (TREC-8) November 1999

45] M Pasca S Harabagiu The informative role of Wordnet in open-domainquestion answering in 2nd Meeting of the North American Chapter of theAssociation for Computational Linguistics (NAACL) 2001

46] AM Popescu O Etzioni HA Kautz Towards a theory of natural lan-guage interfaces to databases in Proceedings of the 2003 InternationalConference on Intelligent User Interfaces Miami FL USA January

12ndash15 2003 pp 149ndash157

47] RDF httpwwww3orgRDF48] K Srihari W Li X Li Information extraction supported question-

answering in T Strzalkowski S Harabagiu (Eds) Advances inOpen-Domain Question Answering Kluwer Academic Publishers 2004

[

Agents on the World Wide Web 5 (2007) 72ndash105 105

49] V Tablan D Maynard K Bontcheva GATEmdashA Concise User GuideUniversity of Sheffield UK httpgateacuk

50] W3C OWL Web Ontology Language Guide httpwwww3orgTR2003CR-owl-guide-0030818

51] R Waldinger DE Appelt et al Deductive question answering frommultiple resources in M Maybury (Ed) New Directions in QuestionAnswering AAAI Press 2003

52] WebOnto project httpplainmooropenacukwebonto53] M Wu X Zheng M Duan T Liu T Strzalkowski Question answering

by pattern matching web-proofing semantic form proofing NIST Special

Publication in The 12th Text Retrieval Conference (TREC) 2003 pp255ndash500

54] Z Zheng The answer bus question answering system in Proceedings ofthe Human Language Technology Conference (HLT2002) San Diego CAMarch 24ndash27 2002

  • AquaLog An ontology-driven question answering system for organizational semantic intranets
    • Introduction
    • AquaLog in action illustrative example
    • AquaLog architecture
      • AquaLog triple approach
        • Linguistic component from questions to query triples
          • Classification of questions
            • Linguistic procedure for basic queries
            • Linguistic procedure for basic queries with clauses (three-terms queries)
            • Linguistic procedure for combination of queries
              • The use of JAPE grammars in AquaLog illustrative example
                • Relation similarity service from triples to answers
                  • RSS procedure for basic queries
                  • RSS procedure for basic queries with clauses (three-term queries)
                  • RSS procedure for combination of queries
                  • Class similarity service
                  • Answer engine
                    • Learning mechanism for relations
                      • Architecture
                      • Context definition
                      • Context generalization
                      • User profiles and communities
                        • Evaluation
                          • Correctness of an answer
                          • Query answering ability
                            • Conclusions and discussion
                              • Results
                              • Analysis of AquaLog failures
                              • Reformulation of questions
                              • Performance
                                  • Portability across ontologies
                                    • AquaLog assumptions over the ontology
                                    • Conclusions and discussion
                                      • Results
                                      • Analysis of AquaLog failures
                                      • Similarity failures and reformulation of questions
                                        • Discussion limitations and future lines
                                          • Question types
                                          • Question complexity
                                          • Linguistic ambiguities
                                          • Reasoning mechanisms and services
                                          • New research lines
                                            • Related work
                                              • Closed-domain natural language interfaces
                                              • Open-domain QA systems
                                              • Open-domain QA systems using triple representations
                                              • Ontologies in question answering
                                              • Conclusions of existing approaches
                                                • Conclusions
                                                • Acknowledgements
                                                • Examples of NL queries and equivalent triples
                                                • References