telem-pub.openu.ac.iltelem-pub.openu.ac.il/users/lilach/ed-media/mihye-thesis%20%e4%e… · i...

THE UNIVERSITY OF NEW SOUTH WALES

Document Management and Retrieval

for Specialised Domains: An Evolutionary User-Based Approach

Mihye Kim

A thesis submitted to The School of Computer Science and Engineering

The University of New South Wales Sydney Australia

in fulfilment of the requirements for the degree of Doctor of Philosophy

March 2003

i

Certificate of Originality

I hereby declare that this submission is my own work and to the best of my knowledge it

contains no materials previously published or written by another person, nor material which to a

substantial extent has been accepted for the award of any other degree or diploma at UNSW or

any other educational institution, except where due acknowledgment is made in the thesis. Any

contribution made to the research by others, with whom I have worked at UNSW or elsewhere,

is explicitly acknowledged in the thesis.

I also declare that the intellectual content of this thesis is the product of my own work, except to

the extent that assistance from others in the project’ s design and conception or in style,

presentation and linguistic expression is acknowledged.

(Signed) Mihye Kim 17/07/2003

ii

Abstract

Browsing marked-up documents by traversing hyperlinks has become probably the most

important means by which documents are accessed, both via the World Wide Web (WWW) and

organisational Intranets. However, there is a pressing demand for document management and

retrieval systems to deal appropriately with the massive number of documents available. There

are two classes of solution: general search engines, whether for the WWW or an Intranet, which

make little use of specific domain knowledge or hand-crafted specialised systems which are

costly to build and maintain.

The aim of this thesis was to develop a document management and retrieval system suitable for

small communities as well as individuals in specialised domains on the Web. The aim was to

allow users to easily create and maintain their own organisation of documents while ensuring

continual improvement in the retrieval performance of the system as it evolves. The system

developed is based on the free annotation of documents by users and is browsed using the

concept lattice of Formal Concept Analysis (FCA). A number of annotation support tools were

developed to aid the annotation process so that a suitable system evolved. Experiments were

conducted in using the system to assist in finding staff and student home pages at the School of

Computer Science and Engineering, University of New South Wales.

Results indicated that the annotation tools provided a good level of assistance so that documents

were easily organised and a lattice-based browsing structure that evolves in an ad hoc fashion

provided good efficiency in retrieval performance. An interesting result suggested that although

an established external taxonomy can be useful in proposing annotation terms, users appear to

be very selective in their use of terms proposed. Results also supported the hypothesis that the

concept lattice of FCA helped take users beyond a narrow search to find other useful

documents. In general, lattice-based browsing was considered as a more helpful method than

Boolean queries or hierarchical browsing for searching a specialised domain.

We conclude that the concept lattice of Formal Concept Analysis, supported by annotation

techniques is a useful way of supporting the flexible open management of documents required

by individuals, small communities and in specialised domains. It seems likely that this approach

can be readily integrated with other developments such as further improvements in search

engines and the use of semantically marked-up documents, and provide a unique advantage in

supporting autonomous management of documents by individuals and groups – in a way that is

closely aligned with the autonomy of the WWW.

ii i

Acknowledgments

There are many people I should thank who helped me to come to this moment. First of all, I

would like to express my special gratitude to my supervisor Prof. Paul Compton for his

guidance, ideas and insights towards this thesis. I am also grateful for his financial support,

encouragement and help. This thesis would not have been possible without him. I must also

thank Dr ByeongHo Kang who suggested this project to me. He sowed the seed of this thesis.

I would like to thank staff and research students who participated in the experimental study.

Their contribution to this thesis is invaluable.

I would like to especially thank Dr Rex Kwok for his generous time in discussing and reading

drafts, and offering suggestions and encouragement. I would also like to thank Dr Bao Vo for

helping in formalising of mathematical formulas, and Pamela Mort and Victor Jauregui for their

feedback on drafts. In particular, I thank Jane Brennan for sharing and encouraging with lots of

love and friendship. Additionally, I would like to thank the following people for their support,

help and friendship: A/Prof. Achim Hoffmann, Prof. Norman Foo, Dr Rodrigo Martinez-Bejar,

Dr Ashesh J. Mahidadia, Dr YoungJu Yho, Hendra Suryanto, SeungYeol Yoo, Tri Minh Cao,

Son Bao Pham, Julian Kerr, Sue Lewis, Angela Finlayson and Abdus Khan. Many thanks are

also due to many people in the School of Computer Science and Engineering, the University of

New South Wales, Australia.

I am especially grateful to all my friends Veronica, Gemma, Cecil ia, Lucy and her husband

Stephen, Maria, Ameleta and her husband Albino for their much love and prayer. I would also

like to give my special thanks to Fr. Augustine for his love, prayer and encouragement. I must

thank my family - my father and mother, brothers and sisters, nieces and nephews - for their

endless love, prayer and emotional and financial support.

I would like to express my gratitude to all members of my community AFI for their love and

support. In particular, I am grateful for their allowance and acceptance of my study for such a

long time living alone here in Australia.

Finally, I thank you - Lord and Holy Mary - for Your endless Love and Grace which allowed

me to persevere, and thank you for Your Companionship in moments of uncertainty, anxiety

and conflict during all my long journey.

iv

I dedicate this thesis to Loving God.

v

Table of Contents

Chapter 1 Introduction .........................................................................................................1 1.1 Document Management and Retrieval for Specialised Domains..........................................1 1.2 The Aim of this Thesis........................................................................................................4 1.3 The Structure of this Thesis ................................................................................................6

Chapter 2 Document Management and Retr ieval in L iterature..........................................8 2.1 General Approach...............................................................................................................9 2.1.1 Boolean Query............................................................................................................ 10 2.1.2 Clustering ................................................................................................................... 12 2.2 Ontological Approach....................................................................................................... 13 2.2.1 A Notion of Ontology ................................................................................................. 18 2.2.2 Types of Ontologies.................................................................................................... 20 2.2.3 The Issues relevant to Ontologies................................................................................ 22 2.3 Formal Concept Analysis Approach.................................................................................. 27 2.4 Proposed Approach........................................................................................................... 29 2.5 Chapter Summary............................................................................................................. 30

Chapter 3 Document Management for Retr ieval with Ripple-Down Rules ...................... 31 3.1 Ripple-Down Rules .......................................................................................................... 32 3.1.1 Background of RDR.................................................................................................... 32 3.1.2 Basics of RDR............................................................................................................ 33 3.1.3 Strengths of RDR........................................................................................................ 35 3.1.4 Limitations of RDR..................................................................................................... 36 3.2 A Help Desk System with Ripple-Down Rules.................................................................. 37 3.2.1 Overview of the System.............................................................................................. 38 3.2.2 Keywords and Help Questions .................................................................................... 39 3.2.3 Knowledge Structure................................................................................................... 40 3.2.4 Knowledge Acquisition............................................................................................... 41 3.2.5 Search Methods .......................................................................................................... 42 3.2.6 Optimising Process of a Rule Tree .............................................................................. 44 3.3 Conclusion and Discussion ............................................................................................... 46 3.4 Chapter Summary............................................................................................................. 49

Chapter 4 Formal Concept Analysis................................................................................... 51 4.1 Basis Notions of FCA ....................................................................................................... 52 4.1.1 Formal Context ........................................................................................................... 52 4.1.2 Formal Concept .......................................................................................................... 53 4.2 Concept Lattice................................................................................................................. 54 4.2.1 Construction of a Concept Lattice ............................................................................... 54 4.2.2 Algorithms for Constructing a Concept Lattice............................................................ 56 4.3 Conceptual Scaling........................................................................................................... 58 4.4 FCA for Information Retrieval .......................................................................................... 62 4.4.1 Godin et al. Approach ................................................................................................. 63 4.4.2 Carpineto and Romano Approach................................................................................ 64

vi

4.4.3 FaIR Approach............................................................................................................ 64 4.4.4 Cole et al. Approach.................................................................................................... 65 4.4.5 Proposed Approach..................................................................................................... 66 4.5 Chapter Summary............................................................................................................. 71

Chapter 5 A Formal Framework of Document Management and Retr ieval for Specialised Domains........................................ 72 5.1 Basic Notions of the System ............................................................................................. 73 5.1.1 Formal Context ........................................................................................................... 73 5.1.2 Formal Concept .......................................................................................................... 74 5.1.3 Concept Lattice........................................................................................................... 75 5.2 Incremental Construction of a Concept Lattice.................................................................. 76 5.2.1 Basic Definitions of the Algorithms ............................................................................ 76 5.2.2 Description of the Algorithms..................................................................................... 77 5.3 Document Management .................................................................................................... 82 5.3.1 Phase One: Reusing Terms in the System.................................................................... 83 5.3.2 Phase Two: Using Imported Terms from Taxonomies ................................................. 83 5.3.3 Phase Three: Using co-occurred Terms in the Lattice.................................................. 87 5.3.4 Phase Four: Identifying related Documents ................................................................. 91 5.3.5 Phase Five: Adding New Terms.................................................................................. 92 5.3.6 Phase Six: Logging Users’ Queries ............................................................................. 93 5.4 Document Retrieval .......................................................................................................... 94 5.4.1 Browsing the Lattice Structure.................................................................................... 96 5.4.2 Entering a Boolean Query ........................................................................................... 96 5.5 Conceptual Scaling........................................................................................................... 98 5.5.1 Conceptual Scaling for a Many-valued Context ......................................................... 100 5.5.2 Conceptual Scaling for a One-valued Context ........................................................... 103 5.6 Chapter Summary........................................................................................................... 107

Chapter 6 Implementation ................................................................................................ 109 6.1 Overview of the System.................................................................................................. 110 6.2 Basic Environment of the System.................................................................................... 113 6.3 Presentation of the System.............................................................................................. 114 6.3.1 Domain of Research Interests in a Computer Science School ..................................... 114 6.3.1.1 Document Annotation ........................................................................................ 114 6.3.1.2 System Maintenance by a Knowledge Engineer.................................................. 121 6.3.1.3 Document Retrieval and Browsing ..................................................................... 123 6.3.2 Domain of Proceedings Papers.................................................................................. 131 6.4 Chapter Summary........................................................................................................... 134

Chapter 7 Exper imental Evaluation ................................................................................. 136 7.1 Experimental Design....................................................................................................... 137 7.2 Experimental Results...................................................................................................... 138 7.2.1 Annotation Mechanisms............................................................................................ 138 7.2.1.1 Users’ Annotation Activities............................................................................... 138 7.2.1.2 Survey: Questionnaire on the Annotation Mechanisms ....................................... 143 7.2.2 Ontology Evolution................................................................................................... 149 7.2.3 Lattice-based Browsing............................................................................................. 155

vii

7.2.3.1 Browsing Structure............................................................................................. 156 7.2.3.2 Survey: Questionnaire on Lattice-based Browsing.............................................. 160 7.4 Chapter Summary........................................................................................................... 173

Chapter 8 Discussion and Conclusion................................................................................ 174 8.1 Motivation...................................................................................................................... 174 8.2 Summary of Results........................................................................................................ 175 8.2.1 Annotation Mechanisms............................................................................................ 175 8.2.2 Lattice-based Browsing............................................................................................. 176 8.2.3 Web-based System.................................................................................................... 176 8.2.4 Imported Ontologies ................................................................................................. 176 8.3 Expectations for Other Domains ..................................................................................... 177 8.4 Future Work ................................................................................................................... 178 8.4.1 Ontologies ................................................................................................................ 179 8.4.2 Annotation Support ................................................................................................... 181 8.4.3 Integration with Other Techniques ............................................................................ 181 8.4.4 Security and Extension.............................................................................................. 182 8.5 Conclusion ..................................................................................................................... 183

Appendix............................................................................................................................. 185 A.1 Retrieval Performance on the Queries in Table 7.11....................................................... 185 A.2 Chi-Square ( ) Matrix for Table 7.17 ......................................................................... 186 A.3 Critical Values of the Chi-Square Distribution used in Chapter 7.................................... 186

Bibliography ....................................................................................................................... 187

2χ

vi ii

List of Figures

Figure 2.1. An infrastructure of the Semantic Web.................................................................. 14 Figure 2.2. An instantiation example of the ontologies (an annotated home page).................... 16 Figure 2.3. A search result using the ontological browser of KA2 ............................................ 17 Figure 2.4. Top-level categories of Cyc (adapted from Lenat and Guha 1990)......................... 21 Figure 3.1. An example of the knowledge structure for the help system................................... 40 Figure 3.2. The result documents by each search method with the keyword “printer” .............. 43 Figure 3.3. An optimising process of a rule tree...................................................................... 44 Figure 4.1. The concept lattice of the formal context in Table 4.1............................................ 55 Figure 4.2. A scale context for the attribute price (Spri ce) in Table 4.4 and its concept lattice. ... 60 Figure 4.3. A scale context for the attribute transmission (Strans) and its concept lattice............ 60 Figure 4.4. Concept lattice for the derived context in Table 4.5............................................... 61 Figure 4.5. Combined scales for price and transmission using a nested line diagram............... 62 Figure 4.6. An example of a line diagram (extracted from Groh et al. 1998)............................ 67 Figure 5.1. A concept lattice of the formal context C in Table 5.1........................................... 75 Figure 5.2. The annotating process of keywords for a document.............................................. 82 Figure 5.3. Examples of hierarchies extracted from taxonomies.............................................. 85 Figure 5.4. A lattice £(D′′′′, K′′′′, I′′′′) of the formal context C′′′′ from Figure 5.1............................... 90 Figure 5.5. An example of a lattice structure........................................................................... 95 Figure 5.6. Partially ordered multi-valued attributes for the domain of research interests....... 101 Figure 5.7. Examples of nested structures corresponding to concepts.................................... 102 Figure 5.8. An example of pop-up and pull-down menus for the nested structure of a concept103 Figure 5.9. A conceptual scale for the grouping name “databases” ........................................ 105 Figure 6.1. Architecture of the system................................................................................... 110 Figure 6.2. An example of a browsing structure.................................................................... 112 Figure 6.3. An example for the annotation of a home page.................................................... 115 Figure 6.4. An example of selecting topics from other researchers ........................................ 116 Figure 6.5. An example of displaying possible relevant topics for the page being annotated.. 117 Figure 6.6. An example of relevant pages with the page being annotated............................... 119 Figure 6.7. An example of identifying related pages.............................................................. 120 Figure 6.8. An example of adding new terms........................................................................ 121 Figure 6.9. An example of editing grouping names............................................................... 123 Figure 6.10. A snapshot of browing the top-level concepts.................................................... 124 Figure 6.11. An example of a browsing structure.................................................................. 125 Figure 6.12. An example of the main features of the lattice browsing interface...................... 126 Figure 6.13. An example of a textword search....................................................................... 127 Figure 6.14. An example of the nested structure of a concept ................................................ 128 Figure 6.15. The search result with the selection of nested items........................................... 129 Figure 6.16. An example of the search result extended by a taxonomy .................................. 130 Figure 6.17. An example of a search result and a hierarchical clustering ............................... 132 Figure 6.18. An example of navigating the concept lattice..................................................... 133 Figure 6.19. An example of a nested structure for a grouping................................................ 134

ix

Figure 7.1. Questionnaire used for the annotation mechanisms.............................................. 144 Figure 7.2. An example of a di fferent view on the hierarchies of terms.................................. 153 Figure 7.3(a). Examples of the browsing structure that evolved............................................. 156 Figure 7.3(b). Examples of the browsing structure that evolved............................................. 157 Figure 7.3(c). Examples of the browsing structure that evolved............................................. 158 Figure 7.4. The first and second questions used in the survey of lattice-based browsing ........ 161 Figure 7.5. The third and forth questions used in the survey of lattice-based browsing .......... 162 Figure 7.6. The questionnaire results on “What did you find?” .............................................. 166

x

List of Tables

Table 4.1. Formal context for a part of “the Animal Kingdom” ............................................... 52 Table 4.2. A procedure of finding formal concepts from the context in Table 4.1 .................... 54 Table 4.3. Summary for the time complexity and polynomial delay of algorithms................... 57 Table 4.4. An example of a many-valued context for a part of a “used car market” .................. 59 Table 4.5. A realised scale context for the scale price in Figure 4.2......................................... 61 Table 5.1. A part of the formal context in the proposed system................................................ 74 Table 5.2. An example of the many-valued context for the domain of research interests........ 100 Table 5.3. Examples of groupings for scales in the one-valued context.................................. 104 Table 7.1. Number of pages annotated.................................................................................. 138 Table 7.2. Task for each phase of the annotation process....................................................... 139 Table 7.3. Number of terms added at each phase for 59 home pages...................................... 139 Table 7.4. Examples of abbreviation classes registered to the system .................................... 142 Table 7.5. The questionnaire results on the annotation mechanisms....................................... 145 Table 7.6. The questionnaire results on the research topics supported.................................... 146 Table 7.7. Cross-distribution between the number of topics on the list and their generality.... 147 Table 7.8. Cross-distribution between the number of topics on the list and their appropriateness

........................................................................................................................... 148 Table 7.9. Cross-distribution between appropriateness and helpfulness of the listed topics.... 148 Table 7.10. The percentage of the selected terms among the relevant taxonomy terms........... 150 Table 7.11. Document retrieval using various taxonomies..................................................... 152 Table 7.12. The respondents’ information............................................................................. 163 Table 7.13. The purpose of the use of the system.................................................................. 164 Table 7.14. The questionnaire results on retrieval performance............................................. 165 Table 7.15. A cross table with respondents and the reasons they failed for retrieval ............... 165 Table 7.16. Cross-distribution between the used search methods and the number of steps taken

........................................................................................................................... 168 Table 7.17. User opinion on search methods for domain-speci fic document retrieval............. 169 Table 7.18. Cross-distribution between lattice-based and hierarchical browsing choices........ 170 Table 7.19. Cross-distribution between lattice-based browsing and Boolean query choices ... 170 Table 7.20. The questionnaire results on the system performance and the user interface........ 171

1

Chapter 1

Introduction

1.1. Document Management and Retrieval for Specialised Domains

The World Wide Web is taking over as the main means of providing information.

Keeping pace with this evolution of the Web, there is a great demand from

organisations as well as individuals for document management and retrieval systems for

specialised domains on the Web. Better organisation of documents can make it easier

for users to readily find the information they want.

There are many search applications that are feature-packed and high-end commercial

products based on conventional information retrieval mechanisms such as Alta Vista,

Infoseek and Excite (Sullivan 2000; Platt 1998). These softwares can be used to index

the information on local Web sites. Current document management and retrieval

systems of organisations greatly depend on such retrieval applications. Extraordinary

progress has been made to the point that the general search engines are used for

innumerable requests for finding information on the Web.

Despite improvements in this area (e.g., Google and Teoma), specific queries for getting

information remain very frustrating. The only search terms the user can think of, may

occur in a myriad of other contexts and perhaps do not even occur in some relevant

documents. The obvious problem with general search systems is in finding the

particular documents that are relevant to one’s interest, query or particular task of the

moment. Another major problem of using information retrieval is the difficulty of

finding or setting appropriate keywords when one fails to get a search result (Rousseau

et al. 1998). In addition, general search engines make no use of domain knowledge and

force users to look at a linear display of loosely organised search results.

2

Some search systems support a better browsing interface (e.g., Alta Vista, Yahoo and

Open Directory Projects) using a handcrafted organisation of documents. However,

such systems are costly to build and maintain. More recently, dynamic clustering search

engines (e.g., Vivisimo and WiseNut) have emerged with an automatic document

clustering feature. These systems also make no use of domain knowledge emphasising

only syntactical analysis between queries and documents as general search engine

mechanisms. The inability to fully analyse the content of a document semantically can

result in low precision.

In view of this, general retrieval mechanisms may not always be the ideal tool to use

when trying to find specific information, particularly for specialised-domains. There is a

need to develop a new approach for domain-specific search mechanisms, rather than

simply using conventional document retrieval systems. The new mechanism should

manage documents for specialised domains of organisations so that they can be readily

and precisely retrieved. It should be easy to build for a given set of documents and be

able to deal with changes easily.

In response to the problems of general search engines, there are new research initiatives

such as the Semantic Web community portal1 and the W3C Web Ontology Working

group2, which have the goal of enriching the information in documents for better

retrieval. The biggest emerging research area is the use of ontologies to explore the

potential of associating Web content with explicit meaning3. Underlying these research

initiatives is the belief that to make full use of the resources on the World Wide Web,

documents would have to be marked up according to agreed ontological standards. This

means that improved search and better organisation of documents will only be possible

by encoding machine processable semantics in the context of the documents using

ontologies. The HyperText Markup Language (HTML) does not support specifying

1 http://www.semanticweb.org/. There are many sub-research groups. Refer to the following web sites:

http://DAML.SemanticWeb.org/, http://Ontobroker.SemanticWeb.org/, http://Protege.SemanticWeb.org/,

http://OntoWeb.SemanticWeb.org/ and others (2002). 2 http://www.w3.org/TR/webont-req/ (2002). 3 For the discussion here an ontology is simply an agreed naming and description convention for the

domain.

3

semantics, but only formatting and hyperlinks. As a consequence, the research

initiatives have developed new representation techniques for documents to encode

semantics based on ontology representation schemes such as XML/S4, RDF/S5, OIL6,

DAML7 and DAML+OIL. Some semantic searches are already available on the Web8

for special purposes.

In general, this approach assumes knowledge engineers or ontology developers will

build ontologies first for a specific subject domain. It then requires users to gain some

mastery of the particular ontology to annotate documents to fit into the ontologies, or

uses specialists in the ontologies to do the annotation. There is also research into

automatic mark-up, but this is a longer term goal. Ontology-based retrieval is intended

to allow users to access information more accurately and explicitly. There are likely to

be considerable practical advantages to even very large communities committing to

specific ontologies, and part of the education process would be to learn the relevant

ontologies. For knowledge management on the Web - facilitating knowledge sharing

and reuse - the contribution of the ontological approach deserves attention. Moreover,

ontology-based reasoning can power advanced information access and navigation by

deriving new concepts automatically based on implied inter-ontology relationships.

Despite the practical advantages of a community committing to an ontology, there is

also a view that any knowledge structure is a construct. Clancey (1993a; 1997)

suggested that when experts are asked to indicate how they solve a problem, they

construct an answer rather than recall their problem solving method. There has been a

wide range of philosophical discussion on this topic, broadly known as situated

cognition. According to Peirce (1931) “knowledge is always under construction,

4 XML/S (eXtensible Markup Language/Schema): http://www.w3.org/XML (2002). 5 RDF/S (Resource Description Framework/Schema): http://www.w3.org/RDF (2002). 6 OIL (Ontology Inference Layer): http://www.ontoknowledge.org/oil (2002). 7 DAML (DARPA Agent Markup Language): http://www.daml.org (2002). 8 The SHOE Semantic Search engine: http://www.cs.umd.edu/projects/plus/SHOE/search/ (2002),

DAML Semantic Search: http://plucky.teknowledge.com/daml/damlquery.jsp (2002),

Knowledge Acquisition Community Search (KA2 Initiative): http://ka2portal.aifb.uni-karlsruhe.de/

(2002).

4

incomplete and continuously assured by human discourse within an intersubjective

community of communication” . Knowledge acquisition systems which try to take

account of the constructed nature of knowledge include Personal Construct Psychology

(Gaines and Shaw 1990) and Ripple-Down Rules (Compton and Jansen 1990). Ripple-

Down Rules in particular emphasises the evolutionary and changing nature of

knowledge. Based on this philosophical perspective, we would like to explore a new

approach for a Web-based document management and retrieval system for specialised

domains.

1.2. The Aim of this Thesis

An alternative approach for specialised domains or specific Web sites may be to allow

users to create their own organisation of documents and to assist them in ensuring

improvement of the system’s performance as it evolves. The aim of this thesis is to

develop a system suitable for organisations as well as individuals to incrementally and

easily build and maintain their own document management system. It should allow free

annotation of documents by multiple users and should continue to evolve both in the

structure of browsing and in the retrieval performance. This means that the browsing

scheme for document retrieval should evolve as the users annotate their documents.

Another aim is to explore the possibilities of document management systems that do not

commit to a priori ontologies and expect that all documents will be annotated according

to the ontologies. The aim is systems, which should support the users in annotating a

document however they like and the ontology will evolve accordingly. Rather than this

being totally ad hoc, we would like the system to assist the users to make extensions to

the developing ontology which in some way are improvements. However, this approach

does not preclude the inclusion of ontologies imported from elsewhere, but only as a

resource that the user is free to use, partially or fully.

The system proposed here uses the lattice-based browsing structure supported by the

concept lattice of Formal Concept Analysis (Wille 1982). Lattice-based browsing is

automatically and incrementally constructed, and is used as the basic structure for

retrieval processes. To incrementally improve the search performance of the system,

5

knowledge acquisition techniques are developed by reusing terms used by others and

terms imported from other taxonomies. The goal of incremental development is similar

to both Ripple-Down Rules (Compton and Jansen 1990) and Repertory Grids (Gaines

and Shaw 1990). The key strategy of the proposed system is to incorporate the

advantages of the concept lattice of Formal Concept Analysis appropriate for browsing,

while keeping the incremental aspects of Ripple-Down Rules.

In summary, the aim of this thesis is to develop a Web-based document management

system for fairly small communities in specialised domains supporting incremental

development of the system over time. The system is based on free annotation of

documents by users and assists the users in ensuring improvement of the system’s

performance as it evolves. The browsing structure is collaboratively created and

maintained over time by multiple users as an ontology (taxonomy) develops. The main

focus is an emphasis on incremental development and evolution of the system.

To evaluate the proposed approach, experiments were conducted in the application

domain of annotating of researchers’ home pages9 according to their research interests

in the School of Computer Science and Engineering, University of New South Wales.

There are around 150 research staff and students in the School who generally have

home pages indicating their research interests. The aim was to allow staff and students

to freely annotate their pages so that they can be found appropriately within an evolving

lattice of research topics. The goal was a system to assist prospective students and

potential collaborators in finding research relevant to their interests.

We have also set up a system10 that allows users to annotate papers from the on-line

Banff Knowledge Acquisition Proceedings. The aim of this system was to provide some

comparability with the ontological approaches such as the KA2 initiative11.

9 http://pokey.cse.unsw.edu.au/servlets/RI. 10 http://pokey.cse.unsw.edu.au/servlets/Search. 11 http://ka2portal.aifb.uni-karlsruhe.de/ (2002).

6

1.3. The Structure of this Thesis

Chapter 2 discusses the current state of document management for information retrieval.

Firstly, a review of the current general Web search engines including document

clustering is carried out. Secondly, we review ontological approaches that aim to better

organise documents to support not only better search results but also better reasoning

with documents. Thirdly, information retrieval based on Formal Concept Analysis

(FCA) is briefly introduced as the core technique of the proposed system is based on

FCA. Finally, the proposed approach for a domain-specific document management and

retrieval system is briefly outlined.

The first attempt at incremental development of document management systems in this

study was based on the techniques of Ripple-Down Rules (RDR). Thus, Chapter 3

provides an overview of RDR with its background and basics, including its strengths

and limitations. Secondly, an automatic help desk system where RDR was used for

document management and retrieval is presented. Then, the issues relevant to the RDR

help desk system are addressed.

Chapter 4 introduces the basic idea of Formal Concept Analysis (FCA) including formal

contexts, formal concepts, concept lattices and conceptual scaling. A number of

algorithms in the literature for constructing a lattice are also presented. Here lattice-

based models for information retrieval where FCA has been applied are reviewed in

detail.

Chapter 5 presents a theoretical framework for the Web-based domain-specific

document management and retrieval system that we propose. Firstly, basic notions of

the proposed system are defined and an incremental algorithm we have developed for

building a concept lattice is provided. Secondly, annotation mechanisms, which

cooperate with the knowledge acquisition mechanisms as a way of document

management, are presented. Thirdly, lattice-based document retrieval both by browsing

a concept lattice and using a Boolean query interface is described. Finally, conceptual

scaling to associate a lattice browsing structure with an ontological structure is

presented.

7

Chapter 6 describes systems implemented on the World Wide Web to demonstrate the

value of the proposed approach. The first system is for obtaining research interests in

the School of Computer Science and Engineering, University of New South Wales

(UNSW). The second is a system that gives access to the on-line Banff Knowledge

Acquisition Proceedings papers.

Chapter 7 presents the experimental results of using the system to find staff and student

home pages for research interests at the School of Computer Science and Engineering,

UNSW. For the experiment the system was made available on the School Web site and

all users’ activities both for searching and annotating their home pages were recorded.

Finally, Chapter 8 provides a brief summary of the thesis. We then conclude with

outlining possible directions for further development of the research presented in this

thesis.

8

Chapter 2

Document Management and Retrieval in Literature

This chapter presents the current state of research into document management for

information retrieval. The ultimate objective of document management is to organise

documents in a better way so users can easily search for the information they want.

The World Wide Web plays a role as a means of organising documents for retrieval. A

HyperText Markup Language (HTML) is currently the basic representation language for

documents on the Web. Documents are presented in the HTML format and managed for

retrieval using a variety of information retrieval techniques. HTML is essentially a text

stream with special codes embedded. These codes are a standard protocol for

presentation on the screen of a Web browser, rather than encoding machine processable

semantic information. Information is described in natural languages in HTML

documents. This simplicity has made possible its dramatic success within a short period.

Moreover, lay persons without any computer background knowledge can create HTML

documents.

But this simplicity also limits its further growth. As indicated earlier, there has been

extraordinary progress made in the development of general Web search engines that are

able to access the stored information on the Web. Despite improvements in this area,

specific queries for getting information still remain very frustrating. Most problems with

the current search engines are due to the limitations of natural language processing. The

complete extraction of semantic meaning that the authors embed in natural languages is

still impracticable.

In response to this problem, many new research initiatives have been set up to enrich

Web resources by developing new representation techniques such as XML(S), RDF(S),

OIL, DAML and DAML+OIL. This is the emerging Semantic Web aiming to encode

9

machine processable semantics based on these representation techniques. Here,

ontologies play the role of the backbone of the Semantic Web (next version of the

Web). These research initiatives believe that to make full use of the resources on the

World Wide Web, documents have to mark up according to agreed ontological

standards to support more accurate information. There is a great expectation for the

value of this approach: “We are going to build a brain of and for humankind” (Fensel

and Musen 2001, pp. 25).

This chapter reviews both general Web search mechanisms and ontology-based

retrieval, starting with the current general Web search engines including document

clustering in Section 2.1. There is a variety of research on information retrieval systems.

The research includes developing better statistical mechanisms, indexing techniques,

stemming, and clustering algorithms, supporting more diverse search options and

logical operations. It also works at improving the user interface dimension and user

feedback, visualisation of information, personalisation and natural language processing.

The focus of this review is to present the advantages and disadvantages of typical search

engine mechanisms in general, rather than looking at every one of these techniques. In

Section 2.2, ontological approaches that aim to better organise documents for better

search results are presented, including addressing the issues with ontologies from the

perspective of the Semantic Web. In Section 2.3, information retrieval based on Formal

Concept Analysis (Wille 1982) is briefly introduced, as the proposed approach is based

on Formal Concept Analysis. Finally, a brief outline of the proposed approach for a

Web document management system for retrieval is presented in Section 2.4.

2.1. General Approach

A Web search engine is a software program that takes a search query from a user, and

finds information corresponding to the user’s query from numerous servers on the

Internet (He 1998). Each search engine has knowledge of the Web and attempts to

provide the required information in response to a user’s information needs. Web

searching can be considered as another form of information retrieval, because most of

the techniques used in current Web search engines are drawn from Information

Retrieval. Web searching deals with semi-structured data (HTML/XML) and a dynamic

10

collection of documents, whereas Information Retrieval deals with unstructured data

(plain text) and generally a static collection of documents.

Search engines collect information on Web pages using Web crawlers (robots) or by

user submission. Then, search systems normalise words contained in a collection of

documents using automatic algorithms, and index the documents based on the

normalised words usually using a vector space model (Salton and McGill 1983) or a

probabilistic model (Turtle and Croft 1991). Information is recollected and reindexed at

regular intervals.

The major research issues in this area are (1) document representation, (2) query

representation, and (3) retrieval methods. Document representation aims to represent the

set of documents by capturing the “essences” for fast checking. Query representation

provides various options to assist users in obtaining better results. The retrieval finds all

documents that are similar to the query and constructs a list of results according to their

apparent relevance. Search engines differ primarily in how they handle each of these

issues. The ultimate objective of this area is to improve the efficiency and effectiveness

of search engines so that they discriminate relevant documents associated with a query

from all other documents in the database.

Broadly speaking there are two ways in which a user interacts with search engines

(document retrieval systems). In one, the user formulates a specific query and some

documents are retrieved in response. In the second approach, the documents are grouped

and the document groups are organised into a structure that can be browsed (document

clustering). The user searches for documents by navigating this organised structure.

2.1.1. Boolean Query

A search engine usually has an empty input box which allows a user to enter a specific

query in a sequence of keywords. The search engine then finds a list of Web sites

(URLs) that are relevant to the user’s query through its database. These sites may be

ranked according to their relevance to the query. This process is normally iterative in

that the user refines the query on the basis of the documents retrieved by each query.

11

The ideal would be that specific queries would always produce the most relevant

documents because the user interface is easy to use and can cover the diverse levels of

user knowledge and retrieval skills. As indicated previously, despite improvements in

this area (e.g., Google12 and Teoma13), finding relevant documents on the Web or even a

single site, remains a frustrating task. The only search terms the user can think of, occur

in a myriad of other contexts. It is frequently difficult to get a search right despite

setting up an apparently specific and appropriate query. The normalised words extracted

using a variety of statistical mechanisms do not always concisely represent the meaning

of the documents due to the limitations of natural language processing. HTML does not

support specifying semantics, only formatting and hyperlinks. In addition, general

search engines force the user to look at a linear display of loosely organised results.

As a result, documents are indexed in a classification scheme or Web directory in many

information retrieval systems. This is often emphasised as a necessity in Information

Retrieval for organising documents (Dewey Decimal System14) and for novice users

who do not know precisely what they want or how to get it (Conklin 1987; Landauer et

al. 1982; Thompson and Croft 1989; Oddy 1977). With browsing15, users can quickly

explore the search domains and can easily acquire domain knowledge (Marchionini and

Shneiderman 1988).

The following section presents how documents can be clustered (categorised) for

browsing, and reviews current clustering search engines.

12 Google (http://www.google.com) is a Web search engine and implements a ranking algorithm based on

l isting the most popular Web sites first. It is a simple technique based on the assumption that those are

most likely to be the sites someone is searching for, improving search performance dramatically. 13 Teoma (http:///www.teoma.com) is a new search engine similar to Google, but Teoma uses subject-

specific popularity, not just general popularity like Google. Subject-specific popularity ranks a site based

on a number of same-subject pages that reference it. This means that Teoma generates a list of subjects

from the results of a query and identifies the list of subjects. Then, it analyses the relationship of sites

within a subject. Teoma also present a set of refinement terms to allow users to clarify their queries.

However, it is not yet proven that Teoma produces a better result than Google. 14 Dewey Decimal System (http://www.oclc.org/dewey/) is the most widely used library classification

system in the world. It has been used for over a century. 15 Browsing is a navigation process of given structures to reach the target information or knowledge.

12

2.1.2. Clustering

The concept of clustering has been investigated for as long as there have been libraries

(Kowalski 1997). It has proved an important tool for constructing a taxonomy of a

domain by the grouping of closely related documents (Faloutsos and Oard 1995; Frakes

and Baeza-Yates 1992; Salton and McGill 1983). Clustering is also used for a

classification scheme (Duda and Hart 1973) and has been suggested as a method for

formulating browsing (Cutting et al. 1993).

There are two ways in which clustering is constructed with information retrieval

systems: pre-clustering and post-clustering. Document clustering has been traditionally

examined based on pre-clustering (Van Rijsbergen 1979). In this approach, clustering is

performed on all documents in advance and constructs a classification scheme (subject

categories). Documents are then located in relation to the subject categories by a

similarity measure between the subject terms and the content of documents. Most

directory systems used on the Web (Alta Vista, Excite, Yahoo and so on) follow this

paradigm. But such manual clustering systems are costly to build and maintain.

Another approach of clustering is based on post-clustering (Croft 1978; Cutting et al.

1992; Leouski and Croft 1996; Charikar et al. 1997; Zamir and Etzioni 1998). In this

approach, the clustering is applied on the returned documents corresponding to a user’s

query so that it produces more precise results than a pre-clustering approach (Hearst and

Pedersen 1996; Zamir and Etzioni 1999). Clustering search engines such as Vivisimo16

and WiseNut17 follow this paradigm.

These search engines are obviously a huge leap forward in dynamic Web clustering and

help users when they are exploring very broad subjects, or when they are looking for

something obscure. This is because clustering search engines automatically and

16 Vivisimo (http://vivisimo.com/) is clustering search engine. It analyses the snippets (title, URLs and

short descriptions) in the search results of a query, and clusters the results into hierarchical sub-categories.

By clicking on a sub-category, a user can get a result page showing only the selected category. 17 WiseNut (http://www.wisenut.com/) is also a clustering search engine similar to Vivisimo, but not

nearly as well done as Vivisimo (http://websearch.about.com/library/searchtips/bltotd010905.htm, 2002).

13

dynamically organise search results of a query into hierarchical sub-categories and these

categories can make it easier for the users to refine their query.

However, the efficiency of search performance is stil l in question, because those

approaches also make no use of domain knowledge, emphasising only syntactical

analysis between queries and documents. The words in documents or snippets do not

always represent the meaning of the documents. Some may be relevant, but others will

not be. In addition, the words which represent the meaning of the documents do not

always exist in the content of the documents. Moreover, most clustering only focuses

on grouping closely related documents into the same cluster (class) and building a one

or two level hierarchical tree structure in which each cluster has exactly one parent.

That is, the clustering only formulates relationships between parent and child classes,

but does not formulate the relationship between classes in the different branches of the

hierarchy. This can cause the problem of category mismatch18 (Furnas et al. 1983)

where one wrong decision can be critical in failing to find the right documents, and

contributes to the low performance of these techniques. If one goes down the wrong

path one must go back up the hierarchy and start again. There is no mechanism for

navigating to other clusters, as there is only a simple taxonomy structure.

2.2. Ontological Approach

In response to the problems of general search engines, new research initiatives have

been set up to enrich Web resources to allow better retrieval. Many researchers consider

that full use of the Web is only possible by encoding machine processable semantics in

the content of the Web presented in the HTML format. Here, ontologies play the role of

the semantics for the Web resources. In this approach, ontologies are built first for a

domain or a specific subject area, and documents in the domain are annotated based on

the ontologies. Then, ontology-based retrieval is supported based on the annotated

ontologies to enable more accurate searches. This allows a simple Boolean search to

extend to complex higher-order searches. For example, “ find all companies which had a

profit increase in 2002 that was less than its profit increase in 2001”. The structures of

18 A category mismatch is a violation of the default correspondence between categories at different levels

of representation.

14

ontologies are also utilised as a browsing scheme. One of the main aims of this

approach is to facilitate the sharing of information between communities as well as

individuals within the groups.

The biggest emerging research area with ontologies is the Semantic Web19 for exploring

the potential of associating Web content with explicit meaning. Figure 2.1 shows an

ontology infrastructure for the Semantic Web. It evolves with Web-based ontology

representation languages such as XML/S, RDF/S, OIL, DAML and DAML+OIL. The

Web Ontology Working Group20 has also been founded by the W3C consortium to

construct a standard Ontology Web Language (OWL) for the emerging Semantic Web.

One hundred and ninety four ontologies covering a wide range of topics are available on

the DARPA Web site21. An ontology-based search is also available for DAML

annotated Web pages (Li et al. 2002)22.

Figure 2.1. An infrastructure of the Semantic Web.

19 “The Semantic Web is an extension of the current Web in which information is given well-defined

meaning, better enabling computers and people to work in cooperation” (Berners-Lee et al. 2001). In

other words, the Semantic Web (http://www.semanticweb.org/) is a vision for the future of the Web to

power more explicit Web search by sharing and integrating information available on the Web. Ontologies

are the backbone of the Semantic Web. 20 http://www.w3.org/TR/webont-req (work in progress, 2002). 21 http://www.daml.org/ontologies/ (work in progress, 2002). 22 http://plucky.teknowledge.com/daml/damlquery.jsp/ (2002).

Metadata Repositories

Ontology Construction

Annotated Web Pages

End User

Web Pages

Or

Annotation Tools / Manual

Ontologies

Inference Engines

User Interface

Query Result

Ontology Representation Languages

Or

15

A more specific example of this type of activity is the KA2 initiative23 (Benjamins and

Fensel 1998; Benjamins et al. 1999; Staab et al. 2000). KA2 starts out with ontologies

appropriate to the domain of knowledge acquisition with the expectation that people in

the community will annotate documents according to those ontologies. These same

users should also be able to use the ontologies to retrieve documents entered by others,

or to use the structures of the ontologies for browsing. The KA2 initiative has 8-

subontologies: organisations, projects, persons, research-topics, publications, events,

research-products and research-groups. Each ontology has its own classes, sub-classes,

attributes, values and relations. There are some hierarchical relationships between

classes and sub-classes. The ontologies are described with the ontology representation

language OIL24 and DAML+OIL25. Figure 2.2 shows an example of the annotated page

of a researcher based on the ontologies26. A demonstration system is also available at

the Web site: http://ka2portal.aifb.uni-karlsruhe.de/. The system supports knowledge

retrieval to access more accurate information based on the annotated pages for the

knowledge acquisition community.

There seem to be many potential benefits from ontologies in performing high quality

semantic searches. Ontology-based retrieval can empower advanced information access

and navigation by deriving new concepts automatically based on implied inter-ontology

relationships (automated reasoning). Users will be able to conduct more accurate

searches, and to find and learn more than they expected. For example, suppose that a

user is looking for the address of a certain person (here “Steffen Staab” shown in Figure

2.2) using the ontological browser of KA2. The system may present more information

than the user expected such as the person’s affiliation, e-mail, research interests,

projects which were annotated based on the ontologies, as shown in Figure 2.3.

23 This research aims at intell igent knowledge retrieval from the Web. Another objective of the initiative

is to gain better insight in distributive ontological engineering processes. The researches choose “ the

knowledge acquisition community” as an ontology domain to model. 24 http://ontobroker.semanticweb.org/ontologies/swrc-onto-2000-09-10.oil (2002). 25 http://ontobroker.semanticweb.org/ontologies/swrc-onto-2001-12-11.daml (2002). 26 This extracts from the Web site (http://www.aifb.uni-karlsruhe.de/~sst/) by viewing the source code of

the page (2002).

16

Figure 2.2. An instantiation example of the ontologies (an annotated home page).

<<HTML> <HEAD> <TITLE>Steffen Staab - Main</TITLE> <META name= “DC.Creater”content=“Steffen Staaf”> ….  </HEAD> <BODY> ….. </BODY> </HTML>

17

Figure 2.3. A search result using the ontological browser of KA2.

However, it still remains an unproven conjecture that ontological approaches will

enhance search capabilities (Uschold 2002). Semantic querying capabilities are active

areas of research, but the computational properties of such a query language, both

theoretical and empirical, are yet to be determined (Horrocks 2002). There are also a

number of critical issues relating to ontological approaches that need to be addressed.

These issues are discussed in Section 2.2.3.

18

To review ontological approaches for knowledge management and retrieval more fully,

the definition of an ontology and what the goals are that people pursue in ontology

communities will be examined. Secondly, the types of ontologies based on a standard

for categorising ontologies will be discussed. Finally, the issues relevant to ontologies

such as ontology construction, the knowledge acquisition bottleneck and the user

interface with the evolution of ontologies will be observed.

2.2.1. A Notion of Ontology

Recently, ontology has become a major subject of interest as a powerful way to express

the nature of a domain. In the knowledge engineering community, ontologies have also

become popular due to the growing importance of knowledge integration, sharing and

reuse in a formal and task independent way. What ontologies are is still a debated issue.

Various definitions of ontology have been presented in the literature. The most cited

definition of an ontology in the knowledge engineering community is as follows: An

ontology is an explicit specification of a conceptualisation (Gruber 1993).

A conceptualisation is an abstract and simplified view of the world. It is a process to

identify a set of abstract objects, concepts and other entities which are presumed to exist

in a certain domain and the relationships that hold among them (Genesereth and Nilsson

1987). Any real world situation can be considered as a particular instantiation of an

ontology. In addition, any knowledge-based system requires some representation of the

world over which it reasons. A central part of knowledge representation for a domain (a

part of the world) is based on elaborating a conceptualisation (Valente and Breuker

1996) and building an ontology. Elaborating a conceptualisation is an essential

component for knowledge representation tasks, because conceptualisations abstract

which things are relevant to be represented and which are not (Davis et al. 1993).

Guarino (1997) cited and reviewed a number of definitions trying to establish a

comprehensive definition of an ontology. In a later paper, he refines Gruber’s definition

(1993) by making clear the difference between an ontology and a conceptualisation as

follows:

19

“An ontology is a logical theory accounting for the intended meaning of

a formal vocabulary, i.e. its ontological commitment to a particular

conceptualization of the world. The intended models of a logical

language using such a vocabulary are constrained by its ontological

commitment. An ontology indirectly reflects this commitment (and the

underlying conceptualization) by approximating these intended models”

(Guarino 1998, pp. 7).

Most researchers agree that an ontology must include a vocabulary and its definitions,

even where there is no consensus on a more detailed characterisation and the

definitions, are often vague (Heflin 2001). Typically, a formal ontology consists of

terms, definitions and formal axioms relating them together (Gruber 1993). The

definitions associate the names of entities in the world such as classes, relations,

functions, and constraints.

Guarino (1995) underlined the necessity for formal ontological principles based on the

interdisciplinary perspective within the knowledge engineering community. First of all,

he pointed out the principles of formal ontologies based on the modelling view of

knowledge acquisition proposed by Clancey, “ the primary concern of knowledge

engineering is modelling systems in the world, not replicating how people think”

(Clancey 1993b, pp.34). In other words, a knowledge base must be the result of a

modelling activity relating to an external environment, rather than a repository of

knowledge extracted from an expert’s mind. Gaines (1993), Gruber (1995) and

Schreiber et al. (1993) sustain a similar view.

Following this perspective, formal ontologies aim to make conceptual modelling less

dependent on particular perspectives. Another principle of formal ontologies is to

facilitate a communication between diverse communities. Additionally, ontologies

support knowledge sharing (Musen 1992; Gruber 1993; Gruber 1995; Pirlein and Studer

1995). Ontologies can share and reuse other ontologies or at least parts of them, for a

variety of different purposes. If there is a well-developed ontology, another ontology

can use the first without having to remodel it.

20

Sharing various underlying definitions, ontologies can be distinguished into different

types. General types of ontologies are described followed by a view of (Laresgoiti et al.

1996; Studer et al. 1998; van Heijst et al. 1997).

2.2.2. Types of Ontologies

Ontologies can be identified under four major categories, namely generic ontologies,

representation ontologies, domain ontologies, and application ontologies, depending on

their generalisation levels or the subject of the conceptualisation. Each ontology

category is briefly described using an example. This however, is not a standard for

categorising ontologies. There are also other ways of describing ontologies such as

information ontology, enterprise ontology, method ontology, upper-level ontology,

lower-level ontology, taxonomical ontology and others.

Generic Ontologies

General ontologies are also referred to as upper-level ontologies or as core ontologies

(van Heijst et al. 1997). These ontologies usually represent general world knowledge. In

the upper-level ontologies, a taxonomy tends to be the central part of the ontologies.

Terms in the world are typically organised in a taxonomy, even when there is some

disagreement about the hierarchy among ontology researchers. All upper-level

ontologies try to categorise the same world, but they are very different at their top-level

(Noy and Hafner 1997).

Cyc27 and Sowa’s ontology (2000) can be considered as typical generic ontologies.

WordNet28, one of the most well developed lexical ontologies, can also be classified in

this category. Figure 2.4 shows the top level of the Cyc hierarchy.

27 To create a general ontology for commonsense knowledge, Cyc (http://www.cyc.com/) was founded by

Doug Lenat in 1994 (Lenat 1995, Lenat and Guha 1990, also see the Web site: http://www.cyc.com/cyc-

2-1/cover.html). The knowledge base is built upon a core of over 1,000,000 hand-entered assertions (or

“ rules”) designed to capture a large portion of consensus knowledge about the world. 28 http://www.cogsci.princeton.edu/~wn/ (2002).

21

Figure 2.4. Top-level categories of Cyc (adapted from Lenat and Guha 1990).

Representation Ontologies

Such ontologies provide a representational framework without committing to any

particular domain. An example of this category is the Frame ontology (Gruber 1993),

which allows users to define concepts of an evolved domain (frames, slots, relations and

constraints of the slots). Users can build a knowledge base by instantiating the concepts

they define. Ontolingua’s Frame Ontology29 is the most representative ontology and for

some years has been considered as a standard in the ontology community.

Domain Ontologies

Domain ontologies specify the knowledge for a particular type of domain. Such

ontologies generalise over application tasks in that domain; such as medical, electronic

or other domains. The KA2 initiative (Benjamins et al. 1999; Staab et al. 2000) can be

categorised as this type of ontology. Ontologies to facilitate the Semantic Web can be

also categorised as these domain ontologies.

29 Ontolingua is the ontology building language used by the Ontolingua Server (Farquhar at al. 1997, also

see the Web site: http://www-ksl-svc.stanford.edu:5915/).

22

Application Ontologies

An application ontology is an ontology used by a particular application containing all

the definitions required for knowledge modelling in the application. It also contains the

information structures for building an application system. Typically, application

ontologies are related to the particular tasks of the application. People can construct an

application ontology adapted to a particular task at hand by importing from existing

ontologies. That is, an application ontology can be formulated by adapting and merging

existing domain and generic ontologies to suit them to a particular task and a domain.

Within each major categorisation of ontologies above, the description levels of the

ontologies are diverse. For example, the Open Directory Project is categorised as a

generic ontology (one of the world’s biggest taxonomies). But only the definitions of

the terms used in this directory system are established. Most application ontologies for

enterprise applications simply use the structure of the domain ontologies (classes,

subclasses and attributes), even though many are turning to a more formal ontology to

accurately share information and interact between communities. The ontology that is

defined by the Web-Ontology Working Group30 to facilitate the Semantic Web requires

specification of classes, attributes and their relationships. This is much simpler than a

formal ontology which is required within ontology communities.

2.2.3. The Issues relevant to Ontologies

Many researchers believe that improved search is only possible by using ontologies to

encode machine processable semantics in the content of the documents. There are likely

to be considerable practical advantages through using various specialised reasoning

services. The explicit representation of the semantics underlying Web pages and

resources should enable intelligent access of heterogeneous and distributed knowledge,

and a qualitatively better level of service (Ding et al. 2002).

30 The working group (http://www.w3.org/TR/webont-req, 2002) has not reached consensus on all topics

as open issues are stil l under discussion (work in progress). But the current speci fied requirements for an

ontology are its classes (general thing), the relationships that can exist among things and the properties

(attributes) those things may have.

23

Although ontologies promise to solve many knowledge management and retrieval

problems on the Web and can play a key role in the Semantic Web, these promises

contain many assumptions such as well constructed ontologies, well annotated pages,

knowledge annotation mechanisms synchronised with ontology evolution, sophisticated

semantic querying capabilities and others.

Ontology Construction

To facilitate communication between agents and people based on ontologies, first of all,

a standard ontology definition and language are required. There are a number of

different ontology representation languages such as RDF, OIL, DAML, DAML+OIL.

However, the meaning of the term ontology is often vague and still there is no widely

accepted formal definition of an ontology, even though communities are trying to

specify common consensus ontologies. Communities try to follow the principle of

ontologies when they construct ontologies, but to some extent ontologies are highly

varied in reality. An application must commit to the same consensus ontologies for

shared meaning to be able to access explicit knowledge sanctioned by the ontologies for

inference engines. To accommodate this issue, an ontology-working group has been

formed to develop a W3C standard ontology language (OWL). However, more flexible

mechanisms will be better than enforcing the use of a standard language. Translation

mechanisms can be an alternative for compatibility between the existing ontology

languages.

Other important issues currently facing ontology research communities relate to

ontology evolution, ontology extension and ontology divergence (Ding et al. 2002;

Heflin 2001; W3C technical report31 2002). Knowledge is constantly changing so

ontologies will change over time. Thus, the management of ontology change is

necessary for consistency with the corresponding changes to knowledge and

information. A Web ontology language and inference engine must accommodate

ontology evolution. One prominent aim of ontologies is to facilitate knowledge sharing

and reuse. A large ontology can be developed by combining, adding and refining

existing ontologies. To achieve this aim, ontologies must use the same terms and

31 http://www.w3.org/TR/webont-req (work in progress, 2002).

24

axioms to model similar concepts and must manage ontology extension. Inference

engines also need to refer to the content of the extended ontology concepts. But most

current ontology systems do not accommodate extension (Heflin 2001). When an agent

needs to develop an application, they can use the existing ontologies, but the ontologies

are often not sufficient and are not easily merged with each other. Issues relating to

ontology extension concern: how agents can extend existing ontologies, and how

inference engines take into account the extended ontology that may be critical for the

knowledge organisation. Thus, the ontology must be designed to adapt well and

complement other ontologies when considering potential applications.

Ontology communities try to build a standard ontology for a domain, but the existence

of diverse ontologies for the same domain is unavoidable. Different people can build

different ontologies for the same domain. When an agent builds a domain-specific

ontology for an application, the agent can use shared ontologies, but the extension of an

existing ontology is often needed. This rule applies to multiple agents building similar

applications in the same domain. As a consequence, application ontologies can be

diverse for the same domain. Therefore, integration mechanisms will be necessary to

accommodate ontology divergence. Two different terminologies can have the same

meaning and the same terminology can have different meanings. The four state

conditions (consensus, correspondence, conflict and contrast) of Gaines and Shaw

(1989) for shared knowledge construction should be considered when different

ontologies are integrated.

Knowledge Acquisition Bottleneck

One of the major issues relating to ontologies is the annotation of ontologies into the

content of documents for machine processible semantics. In theory, automated

annotation tools (e.g., AeroDAML32) may overcome the knowledge acquisition

bottleneck. However, due to the limitations of NLP (Natural Language Processing),

complete automatic annotation is unrealistic (Heflin 2001). Semi-automatic methods,

where human annotators are involved in the annotation process based on techniques

32 AeroDAML (http://ubot.lockheedmartin.com/ubot/hotdaml/aerodaml.html/) is a knowledge markup

tool that automatically generates DAML annotation from Web pages (Kogut and Holmes 2001).

25

from natural language processing, machine learning and information extraction, may be

the optimal solution. A number of semi-automatic semantic annotation tools (e.g.,

OntoAnnotate33, OntoMat34, and SHOE knowledge annotator35) are available. However,

ontology evolution can cause inconsistency between the ontologies and the contents of

annotated documents or meta-data. Ontologies evolve so the annotation process may

need to evolve to synchronise with the corresponding changes to ontologies.

User Interface

Another important area is the query interface. By using an ontological browser, users

may not need to know complete ontologies. However, the users are somehow required

to understand the available ontological terms sufficiently to establish a query able to

infer implicated knowledge in ontologies. Examples can be seen in the DARPA and

KA2 initiative Web sites36. The users may not want to look at the ontological terms and

notions just to form a query. They may prefer to find information with just one or two

query words at first and then refine their query if they are not satisfied with the search

results in a typical retrieval fashion, rather than looking for ontological terms at the first

stage. Of course, keyword search is often inadequate and a parametric search can be

more useful than a keyword search, but there is a significant issue from the useability

point of view in specifying pairs of attributes and values to build a query. Thus, how the

context of the ontological query interface can be designed to allow users to learn about

the contents of ontologies in order to create the desired query is also a challenge.

A user interface based on ontological structures can be useful when users know exactly

what they want to find. On the other hand, if the information one is seeking is not

represented in the ontology, or one does not understand the relation of the ontology to

one’s query, s/he has the same problem as with general search engines. The ideal

approach would be to support a combined mechanism to choose a method

simultaneously such as a semantic ontological search, ontological browsing interface,

33 http://www.ontoprise.com/ (2002). 34 http://annotation.semanticweb.org/ontomat/index.html (2002). 35 http://www.cs.umd.edu/projects/plus/SHOE/KnowledgeAnnotator.html (2002). 36 http://plucky.teknowledge.com/daml/damlquery.jsp and http://ka2portal.aifb.uni-karlsruhe.de/ (2002).

26

and typical retrieval interface which uses Boolean query and browsing of subject

categories. Even though the ontological approach can allow users to access explicit and

exact information by browsing the structures of ontologies, the user may still require a

search by Boolean querying or classification retrieval used in general search engines.

Note that not all issues relating to ontological approaches are discussed in this section.

Knowledge inconsistency encoded in resources, scalability to the Web, ontology

interoperability and ontology learning are also important issues relating to ontology

approaches. Some issues described above are relevant to the realisation of the Semantic

Web vision, rather than domain-specific knowledge management and retrieval that is

the aim of this thesis.

The study of ontologies deals with the a priori nature of reality to capture universally

valid knowledge (Guarino 1995). We believe that most of ontology issues arise from

this belief. However, in spite of these issues, there seems to be many potential benefits

from ontologies in facilitating the sharing of knowledge between and within

communities, as well as in performing high quality semantic searches.

Despite the practical advantages of a community committing to ontologies, there is also

a view that any knowledge structure is a construct, which should be allowed to evolve

over time (Compton and Jansen 1990) as indicated in Chapter 1. Peirce (1931) noted

that knowledge is always under construction and incomplete. Situated cognition

suggests that when experts are asked to indicate how they solve a problem, they

construct an answer rather than recall their problem solving method (Clancey 1993a).

Personal Construct Psychology (Gaines and Shaw 1990) and Ripple-Down Rules

(Compton and Jansen 1990) also account for the constructed nature of knowledge. We

can also expect that increasingly the interactions in which explicit knowledge emerges

will only arise during an interactive and iterative communication involving some sort of

system (Stumme et al.1998). Based on this philosophical background, we would like to

explore a new approach for a Web-based domain-specific document management and

retrieval system. This approach focuses on incremental construction of knowledge in

the context of its use based on the situated cognition view.

27

Rather than committing to a priori ontologies and expecting that all documents will be

annotated according to the ontologies, the aim of this thesis is to explore the

possibilities of a system where a user can annotate a document however they like and

that the ontologies emerge from this. Rather than this being totally ad hoc, we would

like the system to assist the user to make extensions to the emerging ontologies that are

improvements. We are not concerned with automated or semi-automated ways of

discovering an ontology appropriate to a document or corpus (Aussenac-Gilles el al.

2000; Maedche and Staab 2000). Despite the potential of such approaches, from our

more deconstructionist perspective, we are more interested in the role of the reader or

user interpreting documents and deciding on their annotation and the development of an

ontology. The user here may be the individual user, an expert for a specialised domain

or a small community.

However, this does not preclude the inclusion of ontologies either constructed by an

expert or ontologies imported from elsewhere, as part of the ontological structure

preferred by the user. We do not propose a completely ad hoc evolution of an ontology.

It is perfectly sensible for the individual user or group to be influenced by existing

ontological standards, and interfaces should support this. However, rather than being

locked into conforming to a standard, the user should be free to use all, small fragments,

or none of the ontology as best suits their purpose. A new ontology will emerge as their

result, and this itself may become a useful ontology for other groups.

2.3. Formal Concept Analysis Approach

Another approach is based on lattice-based information retrieval using Formal Concept

Analysis (FCA - Wille 1982). This has not yet been widely applied to information

retrieval. In this approach, documents are annotated with a set of controlled terms by

experts or automatic algorithms. Then, using FCA the documents are indexed into a

lattice structure that can be used for browsing. In FCA, a concept is specified by an

extension as well as intension. The extension of a concept is formed by all objects to

which the concept applies and the intension consists of all attributes existing in those

objects. These concepts form a lattice structure, where each node is specified by a set of

objects and the attributes they share. As one progresses down the lattice more attributes

28

are added and so each node covers fewer objects. The lattice can be quite sparse and

have a range of structures, as a node is added only where the attributes at the node

distinguish the objects from those at another node. The mathematics for this are well

established and FCA has been successfully applied to a wide range of applications in

medicine, psychology, libraries, software engineering and ecology, and to a variety of

methods for data analysis, information retrieval, and knowledge discovery in databases.

A number of researchers have advanced this lattice structure for document retrieval

(Godin et al. 1993; Carpineto and Romano 1996a; Carpineto and Romano 1996b; Priss

2000b). Several researchers have also studied the lattice-based information retrieval

with graphically represented lattices for specific domains such as libraries, flight

information, e-mail management and real-estate advertisements (Rock and Wille 2000;

Eklund et al. 2000; Cole and Stumme 2000; Cole and Eklund 2001).

The mathematics of Formal Concept Analysis can be considered as a machine learning

algorithm which can facilitate automatic document clustering. In other words, FCA cab

be considered as one of incremental clustering algorithms based on post-clustering. A

key difference between FCA techniques and the general clustering algorithms in IR is

that the mathematical formulas of FCA produce a concept lattice which provides all

possible generalisation and specialisation relationships between document sets and

attribute sets. This means that a concept lattice can represent conceptual hierarchies

which are inherent in the data of a particular domain. Thus, the lattice can imply all

minimal refinements and minimal enlargements for a query at an edge in the lattice

(Godin et al. 1995). This means that following an edge downward corresponds to a

minimal refinement for the query at the edge in the lattice, and vice versa. In addition,

the hierarchical tree structure, in which each cluster has exactly one parent, can also be

embedded into the lattice structure.

Another difference with FCA is in the method of clustering documents. FCA produces a

lattice structure for browsing. Each point has multiple parents and children which can

be a superior structure to the hierarchical tree. This lattice structure allows one to

navigate down to a node by one path, and if a relevant document is not found one can

29

go back up another path rather than simply starting again. When one navigates down a

hierarchy one tries to pick the best child at each step. If the right document is not found

it is difficult to know what to do next, because one has already made the best guesses

possible at each decision point. However, with a lattice, the ability to go back up via

another pathway opens up new decisions, which one has not previously considered.

A more detailed explanation on how the basic theories of FCA are applied to

information retrieval will be presented in Chapter 4. The previous work on information

retrieval using FCA as well as the differences between the previous work and the

proposed system will also be examined.

2.4. Proposed Approach

The proposed approach uses Formal Concept Analysis (FCA) for domain-specific

document management and retrieval in order to support lattice-based browsing. In other

words, the core of the technology in the proposed system is FCA. The difference in the

proposed approach is mainly in the way the system is used rather than its underlying

FCA basis. The main focus of the proposed system is an emphasis on incremental

development and evolution, and knowledge acquisition tools to support these.

The system is aimed at multiple users being able to add and amend document

annotations whenever they choose. The users are also assisted in finding appropriate

annotations. This results in the automatic generation of a lattice-based browsing system

from the terms used for annotations. The users can immediately view the concept lattice

that incorporates their documents and further decide whether the terms they assigned for

the documents are appropriate. If the browsing does not support the group who

annotated the documents, it will be able to rapidly and easily evolve. The browsing

structure here is increasingly referred to as an ontology (or taxonomy) which evolves

accordingly where users annotate documents in whichever way they like.

The main differences between the previous work on FCA and the proposed system will

be presented in Chapter 4. The main features of the proposed system and these details

will be presented in Chapter 5.

30

2.5. Chapter Summary

There has been extraordinary progress in the development of Web retrieval systems

improving search performance dramatically. There has also been a huge leap forward in

automatic document clustering allowing users to find information faster and helping

them especially when they are looking for something obscure.

Recently there has been great interest in having documents conform to ontological

standards. The goals that the ontology approach pursues along with its notions and

issues were presented. There are likely to be considerable practical advantages through

various specialised reasoning services, but overall it remains an unproven conjecture

that ontological approaches will enhance search capabilities. There are also critical

issues requiring further research to realise a true semantic search.

As an alternative approach, the possibilities of document management systems that do

not commit to a priori ontologies was explored. But this does not prevent the inclusion

of existing ontologies. Rather this is to explore the possibilities of a system where users

can annotate their document in whichever way they like and that ontologies will evolve

accordingly. Based on this assumption, an alternative approach based on the lattice-

based browsing structure of Formal Concept Analysis was proposed.

The first attempt at incremental development of document management systems in this

thesis was based on the Ripple-Down Rules (RDR) techniques so that the next chapter

presents the RDR approach with its strengths and limitations.

31

Chapter 3

Document Management for Retrieval

with Ripple-Down Rules37

Web-based document retrieval systems for a specialised domain (a help desk system)

were developed based on the Ripple-Down Rules (RDR) techniques (Kang et al. 1997;

Kim et al. 1999). The systems are based on a combination of standard information

retrieval techniques and the RDR knowledge acquisition technique. They are the first

attempt at incremental development of document management system in this study. This

approach to document management has some commercial use in help desk support38.

The help system of Kim et al. (1999) allows simple incremental maintenance of the

system’s knowledge so that the search performance of the system can be improved over

time. The idea behind the system is that when a user fails to find a suitable document,

the system would send an expert a log of the interaction. The RDR mechanism then

assists the expert to add new keywords so that the correct document will be found next

time.

Ripple-Down Rules is an attempt to address knowledge acquisition from a situated

cognition perspective (Compton and Jansen 1990). The central idea is that experts are

good at creating justifications for why one conclusion should be given rather than

another. It has been successfully applied to a range of tasks: knowledge reuse, heuristic

search, configuration, machine learning, fuzzy reasoning and others. One of the

significant strengths of RDR is that knowledge acquisition and maintenance are simple

tasks. With this incremental knowledge acquisition mechanism and robust maintenance

methodology, the RDR mechanism has been applied to a help desk system to manage

37 This work was developed for a project in a course work master’s degree (Kim 1999) and has been

partially reported here. This work followed earlier work (Kang et al. 1997). 38 Byeong Kang, personal communication.

32

help documents. It is essential for a help desk system to have some sort of mechanism

such as RDR that allows for easy incremental development and improvement of the

system if it fails to deliver in particular situations.

Section 3.1 gives an overview of Ripple-Down Rules with its background and basics

including its strengths and limitations. Section 3.2 presents an automated help desk

system where RDR was used for document management and retrieval. Finally the issues

relevant to the RDR help desk system are discussed.

3.1. Ripple-Down Rules

3.1.1. Background of RDR

A major criticism of the early work on knowledge based systems (KBS) and the

traditional software engineering approaches come from a situated cognition perspective

which makes the claim that interaction with an expert has been misunderstood. Situated

cognition suggests that when experts are asked to indicate how they solve a problem,

they construct an answer rather than recall their problem solving method (Clancey

1993a). In particular it seems that they construct an answer to justify that their solution

to the problem is appropriate and that this justification depends on the context in which

it is asked (Compton and Jansen 1990).

A simple example is that when a clinician is asked why they believe a patient has a

certain disease, they will often indicate the symptoms that distinguish the case from

other diagnoses the questioner may be considering. This results in a quite different

explanation for different questioners and also for the same questioner on different

occasions. This results in the maintenance problems that occur with expert systems: that

the knowledge provided by an expert is never precise enough or complete enough, even

when the knowledge in the domain itself is not developing (Compton et al. 1989). The

problem is only exacerbated when, as always occurs, the domain itself is evolving.

To address knowledge acquisition from a situated cognition perspective, Compton and

Jansen (Compton and Jansen 1990) invented Ripple-Down Rules (RDR). It is an

33

effective knowledge acquisition and representation methodology which allows a domain

expert to acquire and maintain knowledge without the help of knowledge engineers.

The original motivation was that experts are good at creating justifications why one

conclusion should be given rather than another (Compton et al. 1989). RDR itself

organises the knowledge, and knowledge acquisition and maintenance are easily

achieved. In the RDR method, the expert is only required to identify features that

differentiate between a new case being added from other stored cases already correctly

handled, without considering the structure of the KB. The emphasis on asking experts

about difference is very similar to the use of differences in Personal Construct

Psychology (Gaines and Shaw 1990).

3.1.2. Basics of RDR

In an RDR framework the task of the expert is to check the output of the developing

KBS. If the expert disagrees with the KBS conclusions, it is because they have

identified some data in the input which suggests an alternative conclusion. The features

or data and the conclusion they suggest can be organised as a rule. However, this rule

was provided in the context of a particular mistake, so the knowledge base is structured

so that this rule is reached only in the same context; that is, if the same sequence of

rules leading to the same mistake is activated again.

There are a number of ways of structuring a KBS in this way to make it suitable for

various tasks. A key feature of any RDR system is that since rules are added because of

cases, any cases that prompt the addition of a rule are stored. None of the stored cases,

which are already handled by the other rules in the system, should cause the new rule to

fire. There are a number of ways to ensure this, but a key method is simply to present a

previous case to the expert that is covered by the rule and ask the expert to select further

features that distinguish the cases. This process is repeated, and even with very large

KBS the expert needs only consider two or three of the stored cases.

RDR systems have been implemented in a wide range of application areas achieving

great success in real world problems. The first industrial demonstration of this approach

was the PEIRS system which provided clinical interpretations for reports of pathology

34

testing (Edwards et al. 1993). The approach has also been adapted to a range of tasks:

multiple classification (Kang et al. 1995), control (Shiraz and Sammut 1997), reuse of

knowledge (Richards and Compton 1997b), heuristic search (Beydoun and Hoffmann

1997; 1998a) and configuration (Compton et al. 1998). There are a number of other

lines of RDR research integrating RDR with machine learning (Shiraz and Sammut

1998), fuzzy reasoning (Martinez-Bejar, Shiraz et al. 1998) and discovering of

ontologies (Suryanto and Compton 2000; 2001).

The first approach of RDR assumed a single conclusion for a case (Single Classification

RDR - SCRDR). This produces a decision list with an if-true and if-false structure in a

binary tree with a rule at each node. Every node can have two branches to two other

rules: one to a true node (an exception branch) and another to a false node. If a case

fires a rule, then its child rule (true branch) is evaluated. Otherwise, it is evaluated

against its sibling rule (false branch). The conclusion for a case is given from the

conclusion of the last satisfied rule in the path to a leaf node.

To extend RDR to multiple conclusions, MCRDR (Multiple Classification RDR) was

developed using an n-ary tree (Kang et al. 1995). MCRDR deals with tasks where

multiple independent classifications are required. In MCRDR, every rule can only have

exception nodes. If a case satisfies a rule, then all its children are evaluated. This

process will continue until there are no more child nodes to be evaluated or none of the

child rules are satisfied by the case. Conclusions are given from the last satisfied rule in

each path.

Fuzzy RDR was developed to model and represent fuzzy domain knowledge for

propagating uncertainty values in an RDR knowledge base (Martinez-Bejar, Shiraz et

al. 1998; Martinez-Bejar et al. 1999. See also the Fuzzy RDR Web site39).

More recently, RDR was extended to Nested RDR (NRDR) to facilitate incremental

acquisition of search knowledge where some attributes are not known a priori (Beydoun

and Hoffmann 1997; 1998a; 1999). NRDR uses a single classification RDR structure

39 Fuzzy RDR Web site: http://www.cse.unsw.edu.au/~tmc/Fuzzy/index1.html (2002).

35

and more generally applies to problems. In this structure, a concept is defined by a

separate SCRDR tree. The defined concept then can be used to define other concepts.

That is, the conditions of a rule in an RDR tree can be provided by input data or by an

RDR tree (a concept). When a condition of a rule includes concept(s), the Boolean value

of the condition is calculated in a backward chaining mode. The conclusion for a case

ends up in one path because NRDR is based on a single classification which is either

true or false. Every concept has a dependency list to prevent circularity and recursive

definitions. The dependency list is also used to conduct a consistency check in the KB.

An equivalent system is MCRDR with repeat inference which has been used in

configuration (Compton et al. 1998) and room allocation (Richard and Compton 1999).

This has been generalised (Compton and Richard 1999).

3.1.3. Strengths of RDR

A significant strength of RDR is that knowledge acquisition and maintenance are easily

achieved. RDR itself organises the knowledge and the expert is only required to identify

features that differentiate between a new case being added and the other stored cases

already correctly handled, without considering the structure of the KB.

In the RDR method, a rule is only added to the system when a case has been given a

wrong conclusion. Any cases that have prompted knowledge acquisition are stored

along with the knowledge base. RDR does not allow the expert to add any rules which

would result in any of these stored cases being given different conclusions except by

allowing a specific override. This means that the existing rules’ consistency is kept

(verification and validation) and that there is incremental improvement in the system.

The level of evaluation in RDR systems varies, but they have invariably shown very

simple and highly efficient knowledge acquisition. RDR for the task of providing

interpretative comments for Medical Chemical Pathology reports are now available

commercially. Results from this experience have not yet been published, but confirm

that very large knowledge bases (> 7000 rules) can be built and maintained very easily

36

by pathologists with little computing experience or knowledge (Pacific Knowledge

Systems, personal communication).

The other critical finding from the RDR evaluations is that this form of knowledge

acquisition results in compact and efficient knowledge bases. It might be expected that

incremental addition of knowledge where knowledge was only added (as a refinement),

but never changed, may result in very large knowledge bases with much repeated

knowledge. However, simulation studies show the size of the knowledge bases are

comparable to those produced by machine learning (Compton et al. 1995; Kang et al.

1998) and there is a significant increase in size only when a random choice is used by

the expert. In studies on a human developed MCRDR knowledge base (~3000) rules,

only 10% compression could be achieved (Suryanto et al. 1999).

3.1.4. L imitations of RDR

Despite great success in a wide range of application areas, the current RDR-based

systems have been criticised for their limitations in supplying an explicit model of the

domain knowledge (Richards and Compton 1997b; Martinez-Bejar, Benjamins et al.

1998; Beydoun and Hoffmann 1997). This means that the RDR methodology does not

support higher-level models, especially abstraction hierarchies. RDR assumes a simple

attribute value representation of the world and supports only rules rather than

inheritance or other deductive reasoning from an ontology.

To address this lack, some work has already been done on evolving hierarchies in

parallel with RDR (ROCH; Martinez-Bejar, Benjamins et al. 1998), discovering

abstraction hierarchies from MCRDR (MCRDR/FCA; Richards and Compton 1997b)

and modelling domain knowledge with simultaneous knowledge acquisition (NRDR:

Beydoun and Hoffmann 1998b). However, more research will be needed to develop the

full potential of RDR based on an ontology concept.

Another limitation of RDR is the repetition within the knowledge base (Beydoun 2000;

Richards 1998). But the repetition problem is not a serious impediment of RDR as

addressed in Suryanto et al. (1999).

37

3.2. A Help Desk System with Ripple-Down Rules40

In many areas, help desk services of various forms are provided to assist users in

solving computer related problems. Conventional automated help desk systems (HDS)

assist an organisation in automating the help desk process of handling and resolving

reported problems.

The World Wide Web today takes over as the main means of providing information and

users themselves try to solve their problems by searching for the information through

the Web. As a consequence, an automated help desk is necessary to support an

information retrieval mechanism to make it easier for users to find what they are

looking for. When the users cannot track or solve their requests, they may report their

problems to experts who can manage the help desk. The automated HDS also needs to

be easily maintained as invariably, there is a dynamic knowledge environment. This

means that the help desk system should provide a powerful search and retrieval

mechanism as well as a robust maintenance methodology. We undertook a study to

develop such a help desk system by applying RDR to HDS. This study extended earlier

work where RDR was used for help desk information retrieval and proposed (Kang et

al. 1997), but did not actually present incremental maintenance of the system. Most of

the theoretical background of this study drew on their work.

An automated help desk is essentially a knowledge-based system because it is related to

a user’s problem-solving task. A Case-Based Reasoning (CBR) approach has been

proposed as more appropriate to build up help desk systems (Barletta 1993a; Barletta

1993b; Simoudis and Miller 1991). Ripple-Down Rules (RDR) is grounded on a similar

philosophy to CBR and can be considered as a system which emphasises both cases and

expert knowledge. Previous RDR systems were based on simple attribute value data. In

the current work, a case is a document and its keywords: a situation close to the

conventional CBR application. Thus, we believed that the RDR methodology could be

applied to a help desk system.

40 This section is largely taken from the paper “Kim, M., Compton, P. and Kang, B. H. (1999).

Incremented Development of a Web Based Help Desk System, Proceedings of the 4th Australian

Knowledge Acquisition Workshop (AKAW99), University of NSW, Sydney, 13-29”.

38

3.2.1. Overview of the System

A prototype help system was developed using Multiple Classification RDR (MCRDR)

to build and maintain the knowledge base of the system. Help documents were extracted

from the Frequently Asked Questions page41 maintained by the Help Desk of the School

of Computer Science and Engineering, University of New South Wales.

A user can report their problems to a human expert through the system and the expert

can refine the knowledge base to deal correctly with the user’s problems. A log of the

user session is available for this purpose. The extensions to deal with knowledge

acquisition also resulted in changes to the user interaction, particularly in the expert

assigning further concepts (keywords) to documents and designing questions to assist

the user in specifying the concepts they were interested in.

The system has two main functions. One is for expert(s) to build and maintain a

knowledge base for the help documents. The expert can conduct their own search with a

set of keywords to judge whether the retrieved documents have been incorrectly

classified, keywords are missing from documents or a new document needs to be added.

In adding a new document the expert distinguishes between the keywords of the new

document and the retrieved documents satisfied by the keyword set that the expert used

for the search. After adding the new document if a set of cornerstone cases exist, the

expert should differentiate the cornerstone cases and the current case42 by adding new

keyword(s).

The second function is for users to find the help documents constructed in the

knowledge base. A user can search for information using one of the search methods

supported by the system: “By Keyword” , “By Interaction” and “By Keyword and

Interaction” . When the user is not satisfied with the retrieved documents, they can

report their problem through the system. The reported problem is stored as a new

unsolved case. An expert can then refine the knowledge base by diagnosis of the

reported problems, and the knowledge base is gradually improved.

41 http://www.cse.unsw.edu/faq/index.html (2002). 42 The new document and its keywords comprise the current case.

39

3.2.2. Keywords and Help Questions

This help desk system uses a concept “keyword” to represent the meaning of the help

documents. A keyword can be a representative word to express the meaning, purpose

and role of a document in some way. The keyword may or may not exist in the content

of a document. The key issue here is that human expert(s) decide the keywords for each

help document, rather than automatically deriving them using some machine learning

techniques. This should be a very large task if the expert had to decide on keywords for

a whole body of documents.

With an RDR approach however, keywords are only added to a document when it has

failed to be retrieved or retrieved inappropriately, and there are not already appropriate

keywords to construct rules for the document to be retrieved correctly. As the expert

adds keywords, s/he is also shown documents which might be retrieved by the same

keywords, and the expert is asked to add keywords that distinguish the document that

should be retrieved and the documents being inappropriately retrieved. Because of the

contextual nature of the task, this is trivial for experts. Both RDR and Personal

Construct Psychology (Gaines and Shaw 1990) are based on the fact that people find

making distinctions between objects in a particular context very easy. But of course,

some machine learning techniques can cooperate with the task of experts in assigning

appropriate keywords for the help documents.

When documents are very similar, it is difficult to retrieve appropriate documents by

automated information retrieval algorithms. However, human beings have little trouble

in generating keywords that distinguish documents. The requirement, as for any expert

system, is that experts have sufficient mastery of the domain to make reasonable

distinctions between documents.

Keywords will be created with abstracted words so that to assist users the expert can

attach a help question or an explanation sentence for each keyword. The idea of the help

questions is to give an explanation (or definition) for each keyword in a form of human

thinking way. For example, the keyword “change_login_shell” can be given a help

question such as “How can I change my login shell?” to help users.

40

3.2.3. Knowledge Structure

The knowledge base of the system is stored in an MCRDR tree structure in which each

node of the rule tree corresponds to a rule with a classification (i.e., a document). Figure

3.1 shows an example of the knowledge structure for the system. In an MCRDR tree,

rules are allowed to have one or more conditions. For this study, we preferred to use

rules with only a single condition43. Thus, if a document has more than one keyword,

the second assigned keyword for the document becomes the child node of the first

assigned keyword on the rule tree and so on. As a result of this, nodes which have no

classification (i.e., no document) can exist in this structure. Note that the keywords of

the help documents are used for the rule conditions of the knowledge base in the system.

In Figure 3.1, “a, b, c, d, e, f, g” are the keywords of the documents and are used for the

conditions of the rule tree. Each node of the rule tree corresponds to a rule with a

condition. A classification is essentially a link to an HTML document. The highlighted

boxes represent rules that are satisfied for the test case with keywords {a, b, f} .

Figure 3.1. An example of the knowledge structure for the help system. 43 Note that there can exist a number of different strategies for storing rules in the MCRDR structure. The

main reason for choosing this strategy (only use a single condition for a rule) was the option of using the

knowledge structure for browsing. If a rule with a conjunction of conditions is built, rules would likely be

added towards the top of knowledge structure (i.e., it would be a child rule of the root). Such a flat tree

structure would not be suitable for browsing.

Rule 0: Root

Rule 1: If a then class 1

Rule 2: If b then class 2

Rule 3: If d then No class

Rule 4: If c then class 3

Rule 5: If b then class 4

Rule 7: If e then class 6

Rule 8: If g then class 6

Rule 6: If f then class 5

Test Case: a, b, f

Class1 a

Class6 a ^ e; d ^ g

Class 2 b

No Class d

Class 3 a ^ c

Class4 a ^ b

Class5 a ^ b ^ f

Document1 Document2 No-document Document3 Document4 Document5 Document6

41

With MCRDR, all the rules in the first level of the rule tree for the given case (rule 1, 2

and 3 in Figure 3.1) are evaluated. Then, MCRDR evaluates the rules at the next level

that are refinements of the rule satisfied at the top level and so on. Rules 1 and 2 are

satisfied by the test case {a, b, f} so that the next rules to be evaluated are rules 4, 5 and

7 (i.e., refinements of rule 1). The process will stop when there are no more child nodes

to be evaluated or none of these refinement rules are satisfied by the case. It can end up

with more than one path for a particular case.

3.2.4. Knowledge Acquisition

An RDR knowledge base is built and maintained through the procedure of acquiring a

correct classification, automatically deciding on a new rule’s location and acquiring rule

conditions. The knowledge acquisition for this help system is achieved in the same way

as the standard RDR knowledge acquisition mechanism. A new case is added into the

knowledge base, when a user’s query is not satisfactorily handled (i.e., the case has been

classified incorrectly or the case does not exist in the rule tree).

In this help desk system, when a user is not satisfied with the retrieved documents of

their query, they can report their problem through the system. The reported problem is

then passed on to a human expert as a new unsolved case with the log of information on

the query used for the search and the documents retrieved by the query, and any free-

text comments from the user outlining their problem. The expert can refine the

knowledge base by an analysis of the reported problem and if necessary by e-mailing or

talking to the user to find out what they really want. It is unlikely that all users will

report their problems so that the system logs all users’ search activities. By analysing

these search activities periodically, the expert can also refine the knowledge base.

In MCRDR knowledge acquisition, the system asks the expert to input or select

conditions (keywords) and a conclusion (document) for the case44. After this procedure,

the system searches all the stored cases which can satisfy the given conditions. If the

system finds cases satisfying the conditions, the expert should distinguish between the

44 Again a case consists of a document and its keywords.

42

keywords of the new case and the keywords of the existing documents (cornerstone

cases). The expert will be asked to select features (keywords) which can distinguish

between the cornerstone cases and the new case. This process will be repeated until

there is no cornerstone case which satisfies the new rule. Finally, the new case is stored

in the rule tree as the refinement case of the previously wrongly retrieved document.

3.2.5. Search Methods

Users can retrieve the help documents using three different search methods: “By

Keyword” , “By Interaction” and “By Keyword and Interaction (combined)”. The

keyword method is based on the general information retrieval mechanisms. The

interaction method is based on the MCRDR tree structure. That is, the inference process

of MCRDR is util ised as a browsing mechanism. The last method is the combination of

the keyword and the interaction method.

Keyword Method

With the keyword method, a user can select keywords provided by the system and/or

can enter any textwords. The system provides a list of keywords that have been used as

the keywords of the help documents. Both the rule conditions of the knowledge base as

well as the contents of the documents are searched for the user’s query. The system uses

simple keyword search techniques based on the basic Boolean operators (disjunction

and conjunction).

For example, in Figure 3.2, when the user specifies “printer” as a search term, the

documents corresponding to rules 6, 4, 8, and 9 will be retrieved. The document of rule

6 is retrieved as the content of the document contains the query “printer” . On the other

hand, the documents of rules 4, 8 and 9 are retrieved as the condition of rule 4 is

satisfied by the query. If there are any conditions satisfied by the search term, the

conclusions (documents) of the rules and their refinement (child) rules are selected all

together and presented to the user. This produces useful candidate documents although

the search term is not included in the contexts of the help documents. If the user wants

to get more specific documents among a search result, s/he can select further keywords

with the conjunction operator.

43

Interaction Method

If the user is not good at identifying keywords or knows little about the domain, s/he

may want to be guided by the system to get the information in a similar way to the

directory or classification scheme of general search engines. The system here however,

depends on interaction rather than an index structure. This means that the interaction is

guided by the inference process of MCRDR. The user interacts with the system by

selecting keywords listed by the system.

For example, when the user tries to find some documents using this method in Figure

3.2, the system will initially show the conditions (account, email, www, and printer) of

the top-level rules (rules 1, 2, 3, and 4). The user can select some of these conditions to

continue their search. Suppose that the conditions of rules 3 and 4 are selected. Then,

the system will produce documents 3 and 4 as a search result and will show the

conditions of rules 8 and 9 (refinement of rule 4) as possible further refinements. This

process is repeated until the user finds the documents what they were looking for or

there are no more child rules (refinement rules).

Figure 3.2. The result documents by each search method with the keyword “printer” .

The highlighted boxes (rules 6, 4, 8, and 9) show the rules resulting from the keyword search.

The shadowed boxes (rules 1, 2, 3, and 4) are the first level rules shown with the interaction

method. When a user tries to find the documents by the combined method with the keyword

“ printer” , the grey coloured boxes (rules 1 and 4) will be shown as the first level rules.

0:Root-> No Conclusion

1:account ->doc1

2:email ->doc2

3:www ->doc3

4:printer ->doc4

5:disk_quota ->doc5

6:print_quota ->doc6

7:change_login_shell->doc7

8:cancel_job ->doc8

9:color_printer ->doc9

10:bash ->doc10

11:csh ->doc11

Rule No: Condition->Conclusion

12:ksh ->doc12

44

Combined Method

The last method is the combination of the keyword and the interaction method. In the

combined method, the system finds documents by the keyword method first (i.e., from

both the contents of the help documents and the rule conditions of the KB). Then, the

system organises the MCRDR rule tree with the conditions which contain the

documents that satisfy the user query, and guide the user based on the conditions in the

reorganised rule tree in the same way as the interaction method above.

In Figure 3.2, when the user tries to find documents with the keyword “printer” using

the combined method, the conditions from the grey coloured boxes (rules 1 and 4) will

be shown as the first level rules to be reviewed. Here, rules 2 and 3 are truncated

compared to the interaction method so that with this combined method the system can

reduce the number of interactions and conditions to be reviewed by the user compared

to the interaction method alone.

3.2.6. Optimising Process of a Rule Tree

When a user finds documents using the combined method, the system optimises the rule

tree to reduce the number of conditions to be reviewed by the user, and the number of

interactions between the user and the system. Figure 3.3 shows an optimising process

for a rule tree.

Figure 3.3. An optimising process of a rule tree.

(a): The original rule tree. (b): The optimised rule tree after deleting irrelevant paths from (a).

(c): A shortened rule tree from (b). (d): An alternative shortened rule tree from (b).

0

1 2

9

3

7

4

11 65

10 8

21

6 5

0

8

(a) (b) (d)

2 1

8 5

0

2 8 5

0

(c)

45

We suppose that Figure 3.3(a) is the original MCRDR rule tree. The shaded rules

correspond to the keywords the user specifies. By ignoring the rule paths where no

cases were selected, the subset of the MCRDR rule tree can be obtained as shown in

Figure 3.3(b). Through this optimising process of the rule tree, the number of conditions

that the user has to check is reduced.

The user can interact with this optimised rule tree (Figure 3.3(b)) to find documents.

However, unnecessary interaction between the user and the system can exist in this rule

tree, as nodes without the relevant keywords may still be included. For instance,

documents 5, 8 and 2 are selected by the keyword search, but this does not imply that all

the conditions of the rules in the path up to the point are satisfied. This means that some

conditions may not be necessary to be checked. For example, the condition in rule 6

may not have the potential for the search interaction and this is simpler to just ask the

condition in rule 8.

Consequently, to reduce the number of interactions the system regenerates the optimised

rule tree to the shortened form as shown in Figure 3.3(c). There can exist a number of

different ways to shorten the optimised rule tree. Figure 3.3(d) can be an alternative

shortened tree. A difference between Figure 3.3(c) and Figure 3.3(d) is the number of

interactions to be reviewed by the user. With Figure 3.3(d) one more interaction time

can be reduced compared to Figure 3.3(c). On the other hand, the number of conditions

to be checked is increased at the top level. If we put all the selected rules as the child

rules of the root rule as shown in Figure 3.3(d) (except the parent and its children node

are selected altogether. In this case the original rule tree structure is kept), it can cause

the problem of an increase in the number of conditions to be checked again. The greater

the size of knowledge base, the more serious this problem becomes.

In order to improve this problem we used the following strategy: The rules which are

not selected by the specified keywords are all truncated except for child rules of the root

node that have fired. Based on this strategy, rule 1 is selected and rule 6 is truncated in

Figure 3.3(b). When the parent and its child node are selected all together, the original

rule tree structure is kept. For example, in Figure 3.3(b), if rules 6 (parent node of rule

46

8) and 8 (child node of rule 6) are selected, the relationship structure (parent and child)

is kept. Through this process the number of interactions and the number of conditions

checked by users can be reduced. However, options to explore other branches of the

rule tree can be lost in this optimised and shortened tree. That is, there can exist a trade-

off between the number of interactions reduced and possible relevant branches lost in

this approach. When a user formulates a query with inappropriate keywords, this trade-

off will be entailed. However, options to explore other branches of the rule tree can be

kept using a proper browsing interface (i.e., using different colours for folders).

The implementation and interfaces of the system can be found in (Kim et al. 1999).

3.3. Conclusion and Discussion

This study has taken a step in the direction of finding a new approach to information

retrieval maintenance based on incrementally developing a knowledge base. The earlier

work of Kang et al. (1997) utilised an RDR structure for information retrieval and

developed search methods based on the RDR structure. These suggestions have been

refined in this study focusing on an incremental knowledge acquisition process and a

prototype system for the FAQ page of a School of Computer Science. The central

insight in this study is that an RDR system can be used as a mechanism for indexing and

retrieving documents, whether these are actual documents or expert opinions especiall y

constructed for the KBS. The second major insight that has motivated this application,

is that the same problems of situated cognition that apply to conventional KBS also

apply to building document retrieval systems for particular domains.

This RDR help desk system allows users to enter words that might occur in a document.

It then retrieves documents relevant to the words entered. If only one document is

retrieved the search is over. If not, an interactive session with the user is commenced.

The documents retrieved are all “ conclusions” of various rules in the system so that a

subset of the rules needed to refine the search, can be identified via the documents

retrieved. The system then asks the user to provide further information to identify which

rule applies. For example, if the user is seeking information on a printer queue, they

might be asked whether they want to delete a document from a queue, or estimate when

47

their material might be printed. The further information requested identifies which rule

conditions are satisfied and which rules can fire. Of course further “textwords” can be

added during the interaction to narrow the search.

If the interaction fails to deliver an appropriate document, the query is referred to a

human expert - the actual human operator of a manned help desk or their supervisor.

The user can transmit a free text comment to the expert indicating the nature of their

query and if necessary there can be an interaction between the user and the expert to

clarify the nature of the query. This is not an onerous mechanism and is used in all sorts

of circumstance. In addition, the expert is provided with a trace of the previous

interaction with the automated help desk. This should be more than enough information

to allow the expert to add rules to ensure the correct document is retrieved next time.

The rules simply ask the user whether they are interested in a particular concept and go

through a series of concepts until the document is retrieved.

However, there are a number of questions stil l to be answered. Firstly, the system with

the RDR techniques has not been yet evaluated in routine use, even though there is

some commercial use in help desk support45. Thus, a strong conclusion about the

ergonomic suitability of the method cannot be made. Similarly, knowledge acquisition

has not yet been carried out based on logged and referred queries. There are again

significant ergonomic issues, on whether the experts find the information provided is

adequate for further rules and are happy to add new rules in a timely manner so users

will not become disillusioned. From the experience with RDR elsewhere, we have every

expectation that rules will be easy to add, but this needs to be evaluated. What we

developed was a prototype demonstration that this type of information retrieval and

information retrieval maintenance is possible.

Secondly, the initial mode of interacting with the system is for a user to enter a

conjunction of terms. If this does not produce the correct documents it is followed by an

interactive session where the system leads the user through a dialogue to refine their

search. It is anticipated that it would also be helpful to have some natural language

45 Byeong Kang, personal communication.

48

processing (NLP), particularly for the initial entry. We believe that again RDR may

usefully be applied to this as we are dealing with narrow domains and therefore a

reasonably compact language, even when the user is naive and does not know many of

the domain terms. Research in this area will need to focus on refining the natural

language text form of the input query into a more standard set of features to match the

cases in the RDR case base. Natural language forms have been used in diagnostic

systems (Anick 1993; Barletta 1993b; Burke et al. 1997; Katz 1997).

Thirdly, it is likely that we may further refine the MCRDR structure for this type of

domain (information retrieval). Initially RDR were developed for domains with

attribute-value data with perhaps hundreds of attributes with numerical data and a

smaller number of attributes with a small number of enumerated values. Here the

keywords represent a potentially huge number of Boolean attributes. However, one of

the key assumptions underlying this work was to develop a help desk system for fairl y

small and specialised domains. The goal of the system was to make it easy to develop a

specialist information retrieval system for these domains, as a general search engine

could not devote the effort to getting a right term for each domain.

There is an important contrast between the HDS and previous RDR applications which

needs to be addressed. With previous RDR systems, all the relevant data was known at

the start of the inference and what mattered was the conclusion rather than the inference

path by which it was reached. Here, however, the structure of RDR is used for browsing

(information seeking tasks) to guide the user interaction. That is, the main difference

between the HDS and previous RDR applications is that the user is supposed to see and

browse the knowledge structure of RDR. The RDR approach was initially developed for

knowledge acquisition for knowledge based systems (Compton and Jansen 1990). It

has been applied to a range of tasks, but is best known for its use in providing clinical

interpretations for Chemical Pathology reports (Edwards et al. 1993). In Chemical

Pathology all the data is provided by a laboratory information system, so that reports

can be generated without user involvement; while the task of the expert in adding rules

is simply to identify significant features in the data.

49

On the other hand, information retrieval requires user interaction and there are problems

with this. The user can either enter keywords or respond to queries about keywords.

Users often prefer some sort of browsing mechanism rather than responding to queries.

As well, the ordering of the keywords presented in RDR reflects the historical

development of the system, not the most natural order for the user. Although, as

demonstrated in other RDR work, RDR greatly assists context-specific knowledge

acquisition, it does not organise the knowledge in a way that is suitable for browsing.

One might consider that an evolving RDR system produces a type of hierarchy and that

this will be adequate, as experts tend to provide general rules first (Suryanto et al.

1999). However, this does not necessarily mean that documents that are more general or

more introductory will be found higher up the tree, or that neighbouring documents in

the tree are necessarily appropriate neighbours.


By developing the help desk system, the possibility of a new way of information

retrieval was demonstrated where an expert can rapidly build and maintain an

information retrieval, or a help desk system in his or her area of expertise based on the

RDR techniques. The methodology of RDR for information retrieval is somewhat

different from the use of RDR in other areas. Here, the structure of RDR is used to

guide the user interaction. However, even though RDR greatly assists incremental and

context-specific knowledge acquisition, and a robust maintenance process, it does not

organise the knowledge in a way that is suitable for browsing.

Therefore, more studies need to be conducted to explore some mechanisms for

reorganising the RDR tree to make it appropriate for browsing. Hierarchical structures

for organising relations between terms in a domain and associated information, or

structures for organising documents, are increasingly referred to as ontologies. Thus, in

the longer term we believe that such a hierarchy will need to be reorganised and that

many different structures will be possible depending on different ontological

frameworks. A proper browsing scheme will be required to access these different

ontologies.

50

The next chapter presents the basic notions of Formal Concept Analysis, which is the

core technology in the proposed system. The key strategy of the proposed system is to

incorporate the advantages of the concept lattice of Formal Concept Analysis (FCA)

appropriate for browsing, while keeping the incremental aspects of Ripple-Down Rules

(RDR). FCA has previously been used with RDR expert systems as an explanation tool

(Richards and Compton 1997a).

51

Chapter 4

Formal Concept Analysis

Formal Concept Analysis (FCA) was developed by Rudolf Wille in 1982 (Wille 1982).

It is a theory of data analysis which identifies conceptual structures among data sets

based on the philosophical understanding of a “concept” as a unit of thought comprising

its extension and intension as a way of modelling a domain (Wille 1982; Ganter and

Wille 1999). The extension of a concept is formed by all objects to which the concept

applies and the intension consists of all attributes existing in those objects. These

generate a conceptual hierarchy for the domain by finding all possible formal concepts

which reflect a certain relationship between attributes and objects. The resulting

subconcept-superconcept relationships between formal concepts are expressed in a

concept lattice which can be seen as a semantic net providing “hierarchical conceptual

clustering of the objects… and a representation of all implications between the

attributes” (Wille 1992, pp.493). The implicit and explicit representation of the data

allows a meaningful and comprehensible interpretation of the information.

The method of FCA has been successfully applied to a wide range of applications in

medicine (Cole and Eklund 1996b), psychology (Spangenberg et al. 1999), ecology

(Brüggemann et al. 1997), civil engineering (Kollewe et al. 1994), software engineering

(Lindig and Snelting 2000; Snelting 2000), library (Rock and Wille 2000) and

information science (Eklund et al. 2000). A variety of methods for data analysis and

knowledge discovery in databases have also been proposed based on the techniques of

FCA (Stumme et al. 1998; Hereth et al. 2000; Wille 2001). Information Retrieval is also

a typical application area of FCA (Godin et al. 1993; Carpineto and Romano 1996a;

Priss 2000b; Cole and Stumme 2000; Cole and Eklund 2001).

This chapter is organised as follows: Section 4.1 introduces the basic notions of Formal

Concept Analysis. Section 4.2 describes the concept lattice of FCA, and surveys a

52

number of algorithms in the literature for constructing a concept lattice from a context.

Conceptual scaling, a technique for dealing with many-valued contexts is explained in

Section 4.3. Finally, Section 4.4 reviews lattice-based information retrieval, as the aim

of this thesis is to develop domain-specific document retrieval mechanisms based on the

FCA techniques.

4.1. Basic Notions of FCA

This section describes the basic notions of FCA such as formal contexts and formal

concepts. The formulas used in this chapter closely adhere to the notions of Wille

(1982) and Ganter and Wille (1999). The adjective “ formal” emphasises that Formal

Concept Analysis methods deal with mathematical notions (Ganter and Wille 1999,

pp17). Here, the word context and concept are used to denote a formal context and

formal concept, respectively.

4.1.1. Formal Context

The most basic data structure of FCA is a formal context K := (G, M, I) which consists

of two sets G and M, and a binary relation I between G and M. The elements of G and

M are called the objects and attributes of the context, respectively. The relation I

indicates where an object g has an attribute m by the relationship (g, m) ∈ I. The

relation I sometimes is written gIm.

Table 4.1 shows an example of the formal context (G, M, I) for a part of “ the Animal

Kingdom”. Here, the objects are animals and the attributes are the properties of the

objects. A context is normally represented by a cross table with the object names in the

rows and the attribute names in the columns.

In Table 4.1, the context (G, M, I) consists of a set of objects G { cheetah, tiger, giraffe,

ostrich, penguin} and a set of attributes M { has hair, has feathers, eats meat, has dark

spots, can swim, has long neck} where the relation I is { (cheetah, has hair), (cheetah,

eats meat), (cheetah, has dark spots), …, (ostrich, fas feathers), (ostrich, has long neck),

(penguin, has feathers), (penguin, can swim)} .

53

Table 4.1. Formal context for a part of “the Animal Kingdom”.

a b c d e f

has hair has feathers eats meat has dark spots can swim has long neck

1 Cheetah X X X

2 Tiger X X

3 Giraffe X X X

4 Ostrich X X

5 Penguin X X

A symbol “ X” designates that a particular object has the corresponding attributes.

4.1.2. Formal Concept

Formal concepts reflect a relationship between objects and attributes. A formal concept

is defined as a pair (X, Y) where X is the set of objects and Y is the set of attributes.

The set X is called the extent and the set Y is called the intent of the concept (X, Y).

The following derivation operators are used to compute formal concepts of a context.

For any set X ⊆ G and any set Y ⊆ M, X′ and Y′ are defined correspondingly as

follows: X′ := { m∈M | ∀g∈X: (g, m) ∈ I} and Y′ := {g∈G | ∀m∈Y: (g, m) ∈ I} . Then,

a formal concept is formulated as a pair (X, Y) with X ⊆ G, Y ⊆ M, X′ = Y and Y′ =X.

The formulas 4.1 and 4.2 can be used to construct all formal concepts of a context

denoted by �

(G, M, I). First, all row-intents { g} ′ with g ∈ G (formula 4.1) or all

column-extents { m} ′ with m ∈ M (formula 4.2) are obtained. Then, all their

intersections are found so that all extents X′ or all intents Y′ of the formal concepts of K

can be determined. Following this, the intents of all determined extents are computed.

Note that there are a number of different algorithms in the literature.

(4.1) (4.2)

Table 4.2 shows an example of how all the concepts can be drawn from the context (G,

M, I) in Table 4.1 based on formula 4.2. The detailed process is as follows. Note that

this process is based on the formulae of Wille (1982).

}{ g XXg

′=′∈

� }{ m Y

Ym

′=′∈

�

54

Table 4.2. A procedure of finding formal concepts from the context in Table 4.1.

Step Intent Extent Step Intent Extent

1 { 1, 2 , 3, 4, 5} 1 { } { 1, 2, 3, 4, 5}

2 a { 1, 2, 3} 2 {a} { 1, 2, 3}

3 b { 4, 5}

{ } 3

{b}

{a, b, c, d, e, f}

{ 4, 5}

{ }

4 c { 1, 2} 4 {a, c} { 1, 2}

5 d { 1, 3}

{ 1} 5

{a, d}

{a, c, d}

{ 1, 3}

{ 1}

6 e { 5} 6 {b, e} { 5}

7 f

{ 3, 4}

{ 3}

{ 4}

7

{ f}

{a, d, f}

{b, f}

{ 3, 4}

{ 3}

{ 4}

(a) (b)

Procedure 1: Formulate an extent containing the set of objects G representing the

largest concept of K. Then, perform Procedure 2 for each attribute m.

Procedure 2: Find the set of objects X which contains the attribute m. Following that,

check whether any extent in the list is equivalent to X. If an equivalent extent of X does

not exist in the list, the set X is added as an extent of the attribute. Next, the intersection

of X and all extents calculated in previous steps, is determined. When the intersection

set does not exist in the list, the set is also added as an extent of the attribute. Table 4.

2(a) shows the result of Procedure 1 and 2.

Procedure 3: Then, for each extent in Table 4.2 (a), its intent Y ← { m∈ M | gIm for all

g ∈X} is determined. Table 4.2 (b) shows the result of this step. Now we have 11

formal concepts for the context (G, M, I) in Table 4.1.

4.2. Concept Lattice

The formal concepts of a context K are expressed in a concept lattice which provides

hierarchical conceptual clustering of the objects and a representation of all implications

between the attributes (Wille 1992).

55

Figure 4.1. The concept lattice of the formal context in Table 4.1.

4.2.1. Construction of a Concept Lattice

The concept lattice is the basic conceptual structure of FCA ordered by the smallest set

of attributes (intent) between the concepts. To form a concept lattice, hierarchical

subconcept-superconcept relations between all the formal concepts need to be found.

This is formalised by (X1, Y1) ≤ (X2, Y2) : ⇔ X1 ⊆ X2 (⇔Y2 ⊆ Y1) where (X1, Y1) is

called a subconcept of (X2, Y2) and (X2, Y2) is called a superconcept of (X1, Y1). The

relation ≤ is called the hierarchical order of the concepts. The set of all the formal

concepts of the context (G, M, I) with this ordered relation is a complete lattice in which

the infimum and the supremum are given by formula 4.3 and 4.4.

(4.3) (4.4)

The complete hierarchical subconcept-superconcept relation is called the concept lattice

of the context (G, M, I) denoted by £(G, M, I). The line diagram in Figure 4.1 shows the

concept lattice of the context K in Table 4.1. Each node represents a formal concept (X,

Y). Not only all relations between objects and attributes but also all relations between

objects and between attributes can easily be observed through this lattice.

({cheetah, tiger , giraffe, ostr ich, penguin}, {})

({cheetah, tiger , giraffe} , { has hair} ) ({giraffe, ostr ich} ,

{ has long neck} ) ({ostr ich, penguin} , { has feathers} )

({ cheetah, tiger } , { has hair, eats meat} )

({ cheetah, giraffe} , { has hair, has dark spots} )

({cheetah} , { has hair, eats meat,

has dark spots} )

({ giraffe} , {has hair, has dark spots,

has long neck} )

({ ostr ich} , { has long neck, has feathers} )

({ penguin} , { has feathers, can swim} )

({}, { has hair, has feathers, eats meat, has dark spots, can swim, has long neck} )

)Y ,)X(( : )Y ,X( iIi

iIi

iiIi ∈∈∈

∩′′∪=∨ ))Y ( ,X(( : )Y ,X( ′′∪∩=∧∈∈∈ i

Iii

Iiii

Ii

56

4.2.2. Algor ithms for Constructing a Concept Lattice

Computing a concept lattice is an important issue and has been widely studied to

develop more efficient algorithms. As a consequence, a number of batch algorithms

(Chein 1969; Ganter 1984; Bordat 1986; Ganter and Reuter 1991; Kuznetsov 1993;

Lindig 1999) and incremental algorithms (Norris 1978; Dowling 1993; Godin et al.

1995; Carpineto and Romano 1996a; Ganter and Kuznetsov 1998; Nourine and

Raynaud 1999; Stumme et al. 2000; Valtchev and Missaoui 2001) exist in the literature.

Batch algorithms build formal concepts and a concept lattice from the whole context in

a bottom-up approach (from the maximal extent or intent to the minimal one) or a top-

down approach (from the minimal extent or intent to the maximal one). Incremental

algorithms gradually reformulate the concept lattice starting from a single object with its

attribute set.

Godin et al. (1995) demonstrated that even the simplest and least efficient, incremental

algorithms outperformed all the batch algorithms in their experimental comparative

study. Recently, Kuznetsov and Ob’edkov (2001)46 conducted another comparative

study, both theoretical and experimental. The study found that the performance of

algorithms depends on the properties of input data such as the size of contexts and the

density of contexts. Results indicated that the Godin algorithm (Godin et al. 1995) was a

good choice in the case of small and sparse contexts. On the other hand, the Bordat

algorithm (1986) showed a good performance for large, average density contexts.

However, when the set of objects was small, the Bordat algorithm was several times

slower than other algorithms. The study also indicated that the Kuznetsov algorithm

(1993) and the Norris algorithm (1978) should be used for large and dense contexts.

More recently, the Titanic algorithm (Stumme et al. 2000) and the Valtchev algorithm

(Valtchev and Missaoui 2001) have been released and have not been included in the

above comparative study. The Titanic algorithm is based on data mining techniques for

computing frequent item sets. The experimental results showed that the algorithm is

46 The algorithms of Chine (1969), Norris (1978), Ganter (1984), Bordat (1986), Kuznetsov (1993),

Dowling (1993), Godin (1995), Linding (1999), and Nourine (1999) were used for the comparison.

57

faster than Ganter’s Next-Closure (Ganter and Reuter 1991) for the whole data set under

normal conditions.

The Valtchev algorithm extended the Godin et al. algorithm (1995) based on two

scenarios. In the first scenario, the algorithm updates the initial lattice by considering

new objects one at a time. In the second one, it builds the partial lattice over the new

object set first and then merges it with the initial lattice. The first algorithm showed an

improvement of the Godin et al. algorithm (1995) and was suggested for small sets of

objects. On the other hand, the second was suggested as the right choice for medium-

size sets.

Table 4.3 shows a summary of the time complexity and polynomial delay of

algorithms47. |G| denotes the number of objects, |M| the number of attributes and |L| the

size of the concept lattice. In the Godin algorithm (1995), µ designates an upper bound

on |f({ x} )| where the set of objects associated with the attribute x is denoted by f({ x} ).

When there is a fixed upper bound µ, the time complexity of this algorithm is O(|G|).

The Nourine algorithm (1999) is half-incremental. It incrementally constructs the

concept set, but formulates the lattice graph in batch.

Table 4.3. Summary for the time complexity and polynomial delay of algorithms.

Incremental Time complexity Polynomial delay

Ganter (1984) O(|G|2|M||L|) O(|G|2|M|)

Bordat (1986) O(|G||M|2|L|) O(|G||M|2)

Kuznetsov (1993) O(|G|2|M||L|) O(|G|3|M|)

Lindig (1999) O(|G|2|M||L|) O(|G|2|M|)

Norris (1978) X O(|G|2|M||L|) N/A

Dowling (1993) X O(|G|2|M||L|) N/A

Godin (1995) X O(22µ|G|) N/A

Nourine (1999) X O((|G| + |M|)|G||L|) N/A

Titanic (2000) X O(|G|2|M||L|) N/A

Valtchev (2001) X O((|G| + |M|)|G||L|) N/A

47 Note that not all of the algorithms are indicated in the table.

58

The main issues for computing all the formal concepts of a context are related to how all

the concepts of a context without the repetitive generation of the same concept can be

generated. There are a number of techniques to avoid repetitive generation of the same

concept - divide the set of concepts into several parts, use a hash function, maintain an

auxiliary tree structure or use an attribute cache.

Kuznetsov and Ob’edkov (2001) noted that an empirical comparison of algorithms is

not an easy task for a number of reasons. First of all, algorithms described by authors

are often unclear, leading to misinterpretations. Secondly, the data structures of the

algorithms and their realisations are often not specified. Another issue is related to the

setting up of consensus data sets to use as test beds. The context parameters such as an

average attribute set associated with an object and vice versa, or the size and density of

contexts should be considered. The test environments such as programming languages,

implementation techniques, and platforms are also crucial factors which influence the

performance of algorithms. For example, Valtchev and Missaoui (2001) indicated that

the Nourine algorithm (1999) is the most efficient batch algorithm48, while Kuznetsov

and Ob’edkov (2001) indicated that the Nourine algorithm is not the fastest algorithm,

even in the worst case49. Kuznetsov and Ob’edkov (2001) noted that this result was

probably caused by different implementation techniques. These are the main reasons for

the existence of quite a lot of algorithms in the literature.

4.3. Conceptual Scaling

Conceptual scaling has been initiated in order to deal with many-valued attributes

(Ganter and Wille 1989; Ganter and Wille 1999). Usually more than one attribute

consists of an application domain and each attribute with a range of values so that there

is a need to handle many attributes in a context. In addition, often there is a need to

analyse (or interpret) concepts in regard to interrelationships between attributes in a

domain. This is the main motivation for conceptual scaling.

48 H. Delugach and G. Stumme (Eds.): ICCS 2001, pp. 302. 49 E. Mephu Nguifo et al. (Eds.): CLKDD'01, pp. 43.

59

For instance, the domain of a “used car market” consists of a number of attributes such

as price, year built, maker, colour, body type, transmission and others, and each

attribute with a set of values. Such attributes can present all together in a context named

with a many-valued context. Then, when one is interested in analysing “used cars”

regarding an interrelationship between certain attributes in the many-valued context,

they can combine the attributes of interest into a concept lattice.

A many-valued context is defined as K = (G, M, W, I) which consists of sets G, M, W

and a ternary relation I between G, M and W (I ⊆ G × M × W). The elements of G, and

M are called the objects and attributes of K respectively, and the elements of W attribute

values. The notion (g, m, w) ∈ I indicates the attribute m has the value w for the object

g.

A many-valued context can be represented in a table which is labelled by the objects in

the rows and by the attributes in the columns. Table 4.4 shows an example of a many-

valued context for the domain of a “used car market” . The context (G, M, W, I) consists

of a set of objects G { car1, car2, car3, car4, car5, car6} , a set of attributes M {maker,

transmission, body type, colour, price} and a set of attribute values W {DAEWOO,

HYUNDAI, KIA, SSANGYOUN, Auto, Manual, Hatch back, … , $5,000, $7,000,

$9,000, $11,000, $14,000, $16,000} . An entry in row g and in column m designates the

attribute value w.

Table 4.4. An example of a many-valued context for a part of a “used car market”.

Maker Transmission Body type Colour Price

Car 1 DAEWOO Auto Hatch back White $5,000

Car 2 HYUNDAI Manual Sedan Silver $7,000

Car 3 KIA Auto Convertible Burgundy $9,000

Car 4 DAEWOO Manual Sedan Red $11,000

Car 5 HYUNDAI Auto Coupe Black $14,000

Car 6 SSANGYONG Auto Wagon Silver $16,000

60

Each attribute of the many-valued context can be transformed into a one-valued context

called a conceptual scale. Then, the scales are joined together as a way of interpreting

the concepts of objects. This interpretation process is called conceptual scaling.

A conceptual scale for a particular attribute m ∈ M of a many-valued context is defined

as a context Sm := (Gm, Mm, Im) where Mm ⊆ W is a set of values of the attribute m in the

many-valued context K = (G, M, W, I) and Gm ⊆ Mm.

Figure 4.2 and 4.3 show an example of a scale for the attribute price and transmission in

Table 4.4. A scale context is equivalent to a one-valued context. In a scale context, both

the rows and columns of the table are usually headed by the values of the scale attribute

(e.g., Figure 4.3). However, any expression or interpretation of the values of attributes

can be used to make it easier to define a scale especially for numerical attributes (e.g.,

Figure 4.2). The expressions can denote a range of values for the values of the scale

attribute. To represent these expressions, Cole and Eklund (2001) introduce a function

called the composition operator: For an attribute m, � m : Wm � Gm where Wm = {w ∈ W

| ∃g ∈ G : (g, m, w) ∈ I} . This maps the values of the attribute m to scale objects.

Price Cheap Mid-range Expensive <$5,000 X

$5,000 - $8,000 X X $8,000 - $12,000 X $12,000 - $15,000 X X

>$15,000 X

Figure 4.2. A scale context for the attribute price (Sprice) in Table 4.4 and its concept lattice.

Note that the scale context for the attribute “ price” uses expressions rather than attribute

values (i.e., ≤$8,000 = cheap, � $8,000 & ≤$15,000 = mid-range, � $15,000 = expensive). A

symbol “ X” designates that the row value corresponds to the column value.

Transmission Auto Manual

Auto X

Manual X

Figure 4.3. A scale context for the attribute transmission (Strans) and its concept lattice.

mid-range & expensive

cheap mid-range

cheap & mid-range

expensive

manual auto

61

Table 4.5. A realised scale context for the scale price in Figure 4.2.

Object Cheap Mid-range Expensive

Car1 X

Car2 X X

Car3 X

Car4 X

Car5 X X

Car6 X

Figure 4.4. Concept lattice for the derived context in Table 4.5.

Then, a realised scale can be driven from scales and the many-valued context when a

diagram is requested at run time. Table 4.5 shows an example of this realised scale

context for the attribute price. This is formulated from the scale for the attribute price in

Figure 4.2, and its objects and values in the many-valued context in Table 4.4. In

essence, the derived context is equivalent to a formal context presented in Section 4.1.1.

Figure 4.4 shows the derived concept lattice for the realised scale context in Table 4.5.

A realised scale can be combined into this concept lattice to analyse concepts according

to an interrelationship between two scales.

Figure 4.5 shows a combination of two scales in a lattice structure using a nested line

diagram. The outer structure is the scale of price and the nested inner structure is the

scale of transmission. Concepts of the many-valued context can be interpreted in this

combined concept lattice. For instance, it can be read that there is no “manual” car in

the “expensive” concept and no “auto” car in the “ cheap | mid-range” concept. Note that

the small grey vertex indicates that there is no object which satisfies the attribute value

in the vertex, as opposed to the black vertex.

({ car2), { cheap,

mid-range} )

({ car5} , { mid-range, expensive} )

({ car1, 2} , { cheap} )

({ car2, 3, 4, 5} , { mid-range} )

({ car5, 6} , { expensive} )

62

Figure 4.5. Combined scales for price and transmission using a nested line diagram.

More than one attribute in a many-valued context can be combined in a scale.

Conceptual scaling is also used with one-valued contexts in order to reduce the

complexity of the visualisation (Stumme 1999; Cole and Stumme 2000). In this case,

scales are applied for grouped vertical slices of a large context. More cases for the use

of conceptual scaling can be referred to (Stumme 1999; Cole and Stumme 2000; Cole

and Eklund 2001) and TOSCANA50 (Vogt et al. 1991; Kollewe et al. 1994; Vogt and

Wille 1995).

4.4. FCA for Information Retrieval

Formal Concept Analysis has numerous applications for data analysis and information

retrieval in fields such as medicine (Cole and Eklund 1996b), psychology (Spangenberg

and Wolff 1991; Strahringer and Wille 1993; Spangenberg et al. 1999), ecology

(Brüggemann, Schwaiger et al. 1995; Brüggemann, Zelles et al. 1995; Brüggemann et

al. 1997), social science (Ganter and Wille 1989), and political science (Vogt et al.

1991). There are also applications of FCA in civil engineering (Kollewe et al. 1994),

software engineering (Lindig 1995; Snelting 1996; Lindig and Snelting 2000; Snelting

50 TOSCANA is a software tool set for the visualisation of data with nested line diagrams and for

navigating and the retrieval of objects in databases.

http://www.mathematik.tu-darmstadt.de/ags/ag1/Software/Toscana/Welcome_en.html (2002).

manual auto

{ car5} /mid-range | expensive

{ car 1, 2} / cheap

{ car 2, 3, 4, 5} / mid-range

{ car2} / cheap | mid-range

{ car5, 6} / expensive

63

2000), linguistics (Grosskopf and Harras 1998), libraries (Rock and Wille 2000), and

information science (Eklund et al. 2000). Most of these application systems elaborate

the standard software tool, TOSCANA, which has been developed for analysing and

exploring data based on the methods of FCA. There are a number of other lines of FCA

research for knowledge representation with Conceptual Graphs (Wille 1997; Mineau et

al. 1999), text data mining (Groh and Eklund 1999) and knowledge discovery in

databases (Stumme et al. 1998; Hereth et al. 2000; Wille 2001).

Information Retrieval is one typical application area of FCA. A strong feature that

makes FCA applicable to the field of information retrieval is that FCA can produce a

visible concept lattice, which shows the inherent structure among data in a lattice so that

it can be used as a classification system. A concept lattice of FCA represents the

generalisation and specialisation relationships between document sets and attribute sets.

Thus, the lattice of FCA can represent conceptual hierarchies for the applied domain.

Moreover, it can be superior to the hierarchical tree structure as the lattice gives all

minimal refinements and minimal enlargements for a query (Godin et al. 1995). In

addition, the hierarchical tree structure, in which each cluster has exactly one parent,

can be also embedded into the lattice structure.

With these advantages, a number of researchers have proposed the lattice structure for

document retrieval (Godin et al. 1993; Carpineto and Romano 1996a; 1996b). More

recently several researchers have also studied lattice-based information retrieval with

graphically represented lattices along with nested line diagrams (Cole and Stumme

2000; Priss 2000b; Cole and Eklund 2001).

4.4.1. Godin et al. Approach

Godin et al. (1993) studied the advantage of the lattice method against hierarchical

classification, and also evaluated retrieval performance by comparing the lattice

structure with a manually built hierarchical classification and a conventional Boolean

retrieval method. The performance of hierarchical classification retrieval showed

significantly lower recall compared to the lattice-based retrieval and Boolean querying.

No significant performance difference was found between the lattice-based retrieval and

64

Boolean querying, but the lattice structure was suggested as being an attractive

alternative because of the potential advantage of lattice browsing. The experiments were

performed on a small database extracted from a catalogue of films assigning a set of

controlled terms manually to each film in the database. The prototype interface was

implemented on a standard screen for a Macintosh microcomputer using window, menu

and dialog interface tools and viewing only direct neighbours in the lattice.

4.4.2. Carpineto and Romano Approach

Carpineto and Romano (1995; 1996b) determined that the performance of lattice

retrieval was comparable to or better than Boolean retrieval on a medium-sized database

for a computer engineering collection which was assigned controlled terms manually.

They (Carpineto and Romano 1996a) also extended their study using a thesaurus as

background knowledge in formulating a browsing structure of FCA and presented

experimental evidence of a substantial improvement after the introduction of the

thesaurus. The interface developed by Carpineto and Romano (1995; 1996b) showed the

lattice graph using a similar fisheye view technique (Furnas 1986)51 of individual nodes

on a stand-alone Symbolic Lisp Machine. A Boolean query interface (Carpineto and

Romano 1998) was also supported to move directly to a relevant portion of the lattice

from a user’s query. It allowed users to navigate the lattice dynamically and made it

easy to refine the query.

4.4.3. FaIR Approach

More recently, FCA has been used for document retrieval culminating in a faceted

information retrieval system (FaIR) that incorporates a lattice-based faceted thesaurus

(Priss 2000a; 2000b). In this approach, a thesaurus is predefined for an applied domain

and divided into a number of facets52, called a faceted thesaurus. A portion of a

hierarchy in the thesaurus can be a facet and it is represented with a lattice rather than a

51 A fisheye view (Furnas 1986) is a technique to view a specific portion of information in great detail

while also showing the context that contains the detail. 52 “Facets are relational structure consisting of units, relations and other facets selected for a certain

purpose” (Priss 2000a).

65

hierarchical tree structure. Then, documents are indexed into the concepts of the facet

lattices by mapping the keywords of the documents to the facet concepts (i.e., by

mapping functions). A document can be indexed into more than one facet lattice.

Documents are retrieved by selecting a facet and a concept of the facet. A main facet

lattice is then provided with the retrieved documents. Other facet lattices relevant to the

retrieved documents are also displayed along with the main facet lattice if exist. The

main advantage of FaIR is that large retrieval sets can be divided into smaller sets (i.e.,

facets) in a retrieval display. In addition, a set of concepts can be retrieved in response

to a query from conceptual relationships among terms that are inherent to the domain

thesaurus. The system (FaIR) described in the paper (Priss 2000b) is under development

and a navigating interface is not yet published.

4.4.4. Cole et al. Approach

The focus in Godin et al. (1993), and Carpineto and Romano (1995; 1996b) was to

examine the advantages and capabilities of lattice-based retrieval against a conventional

Boolean querying and hierarchical classification retrieval. Cole et al. (Cole and Stumme

2000; Cole et al. 2000; Cole and Eklund 2001) have further developed more precise

browsing mechanisms by combining conceptual scales using nested line diagrams for an

e-mail management system (CEM) and a real estate system.

CEM (Cole and Stumme 2000; Cole et al. 2000) uses the concept lattice of FCA to

organise and browse e-mails rather than a typical tree structure. It is based on

TOSCANA, but a user can maintain and update an e-mail collection instead of a

knowledge engineer. Each e-mail is assigned a set of catchwords. A hierarchy, a

partially ordered set, is comprised of more general catchwords. Even though the

hierarchy is represented by a tree, the embedded structure is a concept lattice. E-mails

are managed based on the hierarchy. A cluster in the hierarchy can be a scale (i.e.,

default scale) and other scales can also be formulated to group related catchwords

together. In assigning catchwords to an e-mail, the system identifies relevant

catchwords to the e-mail from the general catchwords used for the hierarchy. The

catchwords in the clusters of the hierarchy which include the identified relevant

catchwords are added automatically as the catchwords for the e-mail. Other specific

66

catchwords are also added to the e-mail. In response to a user’s query, a virtual folder

represented in a lattice structure is formulated with a collection of e-mail documents

retrieved. Then, the user can navigate e-mails in the conceptual space by a scale in a

simple line diagram or by combining scales in nested line diagrams.

CEM has been extended into a system for real-estate advertisements (Cole and Eklund

2001). CEM used conceptual scaling to deal with a one-valued context (i.e., in one

attribute catchword), whereas the real estate system used it for a many-valued context

(i.e., many attributes such as number of bedrooms, rental price, views and others). In

this system, attributes and their values for the advertisements are pre-defined, and are

presented in an ordered hierarchy. Scales are also predefined for each attribute in the

hierarchy. In mapping advertisements to the hierarchy, the system parses the contents of

real-estate advertisements in an HTML file, and extracts object information based on

the values of the predefined attributes for the advertisements. Then, the system maps

objects to the hierarchy based on the extracted information. The hierarchy becomes the

main navigation space. A user can also navigate in a conceptual space by combining

two scales in a nested line diagram.

4.4.5. Proposed Approach

Incremental Development

The main difference between the previous work and our work is an emphasis on

incremental development and evolution for a document management and retrieval

system (Kim and Compton 2001b; 2001c). The main aim of our study is a browsing

mechanism for retrieval which can be collaboratively created and maintained and where

users evolve their own organisation of documents but are assisted in this to facilitate

improvement of the search performance of the system as it evolves.

Web-based System

Another difference is that our focus is on a Web-based system (Kim and Compton 2000;

2001a) using a hypertext representation of the links to a node, but without a graphical

display of the overall lattice. Lin (1997) discussed how visualisation through a graphical

interface could enhance information retrieval. In fact, except for the Godin et al.

67

method, all navigation mechanisms in the applications of FCA are devoted to exploring

the lattice graph itself. Figure 4.6 shows a typical navigation space in the FCA

approach. Even though we agree that the lattice diagram can be a useful tool to analyse

and explore the whole map of a domain, we anticipate that most Web users are

unfamiliar or uncomfortable with concept lattice diagrams and viewing of the whole

lattice diagram will also remain a problem. Accordingly we have developed a Web-

based lattice display using hyperlinks and URLs. We believe that the hyperlink

technique is a fairly natural simplification for a lattice display without losing any

advantage of FCA. It is also very comprehensive and natural to use for Web users. The

browsing system developed by Cole and Eklund (2001) is also Web-based, but it is

implemented as a stand-alone application using line diagrams and it can only browse the

lattice with pre-defined scales53.

Figure 4.6. An example of a line diagram (extracted from Groh et al. 1998).

53 http://meganesia.int.gu.edu.au/cgi-bin/projects/rentalFCA/BrowseREC.pl?context=region&map=y

(2002).

68

Integration with General Information Retr ieval Mechanisms

We have also integrated a number of information retrieval mechanisms into lattice

browsing. Firstly, a Boolean query interface is combined with the FCA browsing

interface in a similar way to Carpineto and Romano (1998). The approach of Carpineto

and Romano is simply to move to a relevant portion of the lattice with a user’s query. In

our approach, a number of information retrieval techniques are combined to the query

interface such as eliminating stopwords, stemming and expanding user query based on

synonyms and abbreviations. In other words, a user can formulate a query by entering

any textwords as a conventional Boolean query interface. Then, the system normalises

the user query by eliminating stopwords and stemming, and extends the query based on

abbreviations and synonyms. Following that, the system identifies the most relevant

portion in the lattice for the query. The user can navigate the relevant documents

starting with the portion of the lattice.

Secondly, a textword search is supported. This is invoked automatically to identify the

relevant documents from the context when the system fails to get a result from the

lattice nodes. The system formulates a sub-lattice with the results which contain the

user’s query in their context (a set of documents) and their keywords (a set of

keywords). Navigation can be carried out on this sub-lattice.

Conceptual Scaling

Conceptual scaling is also supported to allow users to get more specific results and to

search relevant documents by a relationship between the domain attributes and the

keywords of documents. One might consider that conceptual scaling is similar to the

optimising process of a rule tree described in Chapter3. However, in essence,

conceptual scaling gives a view of a lattice formed from objects that have the specified

attribute value pairs. On the other hand, the tree optimisation is to optimise the rule tree

to reduce the number of conditions to be reviewed by the user, and the number of

interactions between the user and the system.

In the proposed approach, a many-valued context with the obvious attributes for the

evolved domain is formulated and relevant values in a one-valued context of the

69

keyword sets are grouped (Kim and Compton 2001a; 2001b). A nested structure is then

automatically derived on the fly from the attributes of the many-valued context and the

grouping names of the one-valued context corresponding to search results. That is, a

concept lattice is built using the keyword sets of the resulting documents in response to

a user’s query as an outer structure and from this a nested structure is produced. A user

can navigate recursively among the nested attributes in regard to the interrelationship

between the outer structure (keywords) and the nested attributes.

The e-mail management system and real estate system of Cole et al. (Cole and Stumme

2000; Cole and Eklund 2001) also support conceptual scaling. The techniques of

conceptual scaling of Cole et al. and our approach are quite similar. In the system of e-

mail management (CEM), scaling is applied to a one-valued context for the attribute

catchword. As indicated earlier, a user can define a hierarchy on more general

catchwords and e-mails are managed based on this hierarchy. A cluster in the hierarchy

can be a scale (i.e., default scale). In addition, the user can establish scales to group

together related catchwords. Then, the user begins their search by requesting a scale

they have defined or a default scale in a single line diagram. The user can also navigate

e-mails by combining two scales in a nested line diagram.

In the real estate system (Cole and Eklund 2001), attributes and their values for the

advertisements are pre-defined, and are ordered in a hierarchy. Objects and their

attributes are managed in a many-valued context, and the objects are mapped into the

hiearcrchy. Then, the hierarchy becomes the main navigation and serves as a general

hierarchical classification system. Scales are predefined for each attribute and its values

in the many-valued context. A user can navigate in a conceptual space by combining

two scales in a nested line diagram. For example, if the user is interested in observing

fully-furnished mid-range properties, they can combine the scale “price” with a scale

for “ furnished” in the conceptual space.

On the contrary, in our approach, a concept lattice is dynamically built with the

annotated documents and their keywords as an outer structure. This outer lattice

structure along with the keywords set becomes the main navigation space. Then, a

70

many-valued context is defined with attributes for the evolving domain based on a

partially ordered hierarchy among the attributes. The hierarchy can be considered as an

ontological structure of the evolving domain described with the most obvious attribute.

Each attribute and its values in the many-valued context become a scale. A knowledge

engineer (or a user) can also group relevant values in the keywords of documents and a

grouping becomes a scale. The groupings can be defined whenever they are required. A

nested structure then is constructed dynamically and automatically from the search

results of a corresponding concept of the outer lattice at run time. That is, all relevant

scales for the search results are extracted from the scales of the many-valued context

and the groupings of the one-valued context, and are included in a nested structure using

pop-up and pull-down menus. A menu structure is incorporated with the hierarchy of

the many-valued context and/or the hierarchy of groupings. The user can navigate

recursively among the nested attributes by observing the interrelationship between the

attributes as well as the outer structure. The system supports a link to the nested

structure of a concept of the outer lattice.

The main difference between the approach of Cole et al. and the proposed approach is

that the systems of Cole et al. start with a predefined hierarchy which transforms into a

basic structure of managing objects and a main navigation space. On the other hand, in

the proposed approach a lattice structure develops automatically as the system evolves.

This lattice structure is driven from the annotated keywords for documents, and the

lattice evolves into a basic structure of indexing documents and a main browsing space.

Another difference is that the systems of Cole et al. display a selected scale in a line

diagram and combine two scales using a nested line diagram in a conceptual space.

Whereas, the proposed system includes all scales which are relevant to a search result at

run time, in pop-up and pull-down menus as a nested structure of the outer structure.

Finally, the systems of Cole et al. are implemented as a stand-alone application, whereas

our focus is both for a single user or a multi-user application focusing on Web

environments and Web users organising objects identified by their URL. The browsing

system developed by Cole and Eklund (2001) is also Web-based, in that it gets material

from the Web and it can only browse the lattice with pre-defined scales. However, the

71

advantages of both methods can depend on the properties of the applied domains. The

approach of Cole et al. can be a good choice where an ontology can be imported or can

be easily constructed for the application. In addition, their graphical user interface (a

line diagram) can be useful for a reasonably small domain.


This chapter presented an overview of the basic theories of Formal Concept Analysis

and its application areas especially focusing on the field of information retrieval.

Algorithms for computing all concepts of a formal context and its concept lattice were

surveyed. A variety of algorithms exist in the literature. There is no “best” algorithm,

rather the performance of algorithms depends on the properties of input data such as the

size and density of contexts (Kuznetsov and Ob’edkov 2001). The main issues for

computing all the formal concepts of a context are related to how to generate all the

concepts of a context and how to avoid repetitive generation of the same concept.

Conceptual scaling was introduced in order to deal with many-valued attributes as well

as to reduce the complexity of visualisation in one-valued contexts. It was successfully

demonstrated in TOSCANA. The resulting conceptual hierarchy allows users to have a

structured overview over their queries, and to interpret the concepts based on the

interrelationship between the attributes.

The method of Formal Concept Analysis was applied to a wide range of application

fields. Lattice-based information retrieval using FCA is one of those areas. Its

significant advantage for information retrieval is that the mathematical formulae of FCA

can produce a conceptual structure which provides all possible generalisation and

specialisation relations among the concept nodes so that it can be used as a browsing

scheme. Lattice-based models for information retrieval in the literature were reviewed

by addressing similarities and differences with the approach that we propose. The main

difference between the previous work and our work is an emphasis on incremental

development and evolution, and knowledge acquisition tools to support these for

domain-specific document retrieval systems. A further difference is that our focus is on

a Web-based system managing documents distributed across the Web. The details of the

proposed approach will be given in Chapter 5.

72

Chapter 5

A Formal Framework of

Document Management and Retrieval for Specialised Domains

This chapter presents a theoretical framework for a domain-specific document

management and retrieval system that we propose. This is based on Formal Concept

Analysis (FCA) and is aimed for a Web-based system for organisations in specialised

domains. This approach allows users themselves to freely annotate their documents and

to find appropriate annotations for new documents. Any relevant documents can be

managed by annotating with any terms the users or authors prefer. This results in the

automatic generation of a browsing system which can fit into a predetermined

taxonomical ontology used for browsing in information retrieval.

The main focus is on incremental development and evolution, and we provide

knowledge acquisition tools to support this. The knowledge acquisition mechanisms

encourage reuse of terms used by others and imported terms from other taxonomies. A

conceptual lattice-based browsing structure for retrieval is automatically and

incrementally created and maintained from the annotation of users. Document retrieval

is based on navigating this lattice structure. The browsing structure is scaled (conceptual

scaling) with the evolving ontological structure of the domain to allow more specific

results or to group relevant documents together. We have previously described the main

features of this system (Kim and Compton 2000; 2001a; 2001b; 2001c).

Section 5.1 defines the basic notions of the system such as formal contexts, formal

concepts and concept lattice. The definitions and formulas in this chapter closely adhere

to the basic work of Ganter and Wille (1999). Here, the word context and concept are

used to respectively mean a formal context and a formal concept as in Chapter 4.

Section 5.2 introduces an incremental algorithm we have developed for building a

concept lattice. Section 5.3 presents how documents can be managed by users

73

themselves cooperating with the knowledge acquisition tools we propose. Section 5.4

describes document retrieval in the proposed approach using both browsing (of the

concept lattice) and a Boolean query interface. Section 5.5 presents conceptual scaling

both in a many-valued context and a one-valued context.

5.1. Basic Notions of the System

5.1.1. Formal Context

The most basic data structure of Formal Concept Analysis is a formal context. In the

original formulation of FCA, an object was implicitly assumed to have some sort of

unity or identity so that the attributes applied to the whole object (e.g., a dog has four

legs). Clearly documents do not have the sort of unity where attributes will necessarily

apply to the whole document. Any sort of keyword or attribute approach to document

management has the same problem. As well it is imagined that many documents will

increasingly be structured with URLs having multiple sections. However in order to use

FCA, it is assumed that documents correspond to objects and the keywords or terms

attached to documents by a user constitute attribute sets. A formal context is defined for

the system that we propose as follows:

Definition 1: A document-based formal context is a triple C = (D, K, I) where D is a set

of documents (objects), K is a set of keywords (attributes) and I is a binary relation

which indicates whether k is a keyword of a document d. If k is a keyword of d, it is

written dIk or (d, k) ∈ I.

A context is represented by a cross table with the document names in the rows and the

keyword names in the columns. Table 5.1 shows an example of the formal context of C

where a set of documents D is { 1, 2, 3, 4, 5} 54, a set of keywords K is {artificial

intelligence, knowledge acquisition, machine learning, behavioural cloning, knowledge

engineering, knowledge representation, belief revision, ontology} and the relation I is

54 Here, numbers are used to indicate document names or URLs for reasons of convenience.

74

Table 5.1. A part of the formal context in the proposed system.

Artificial

Intelligence Knowledge Acquisition

Machine Learning

Behavioural Cloning

Knowledge Engineering

Knowledge Representation

Belief Revision Ontology

1 X X X X

2 X X X

3 X X X

4 X X X

5 X X X X

A symbol “X” designates that a particular document has the corresponding keywords.

{ (1, artificial intelligence), (1, knowledge acquisition), (1, machine learning), (1,

behavioural cloning), (2, artificial intelligence), ..., (4, knowledge representation), (4,

belief revision), (5, artificial intelligence), (5, knowledge acquisition), (5, knowledge

representation), (5, ontology)} .

5.1.2. Formal Concept

Formal concepts reflect the relationships between documents and keywords. A formal

concept for the proposed system is defined as follows:

Definition 2: A formal concept of the context C = (D, K, I) is defined as a pair (X, Y)

such that X ⊆ D, Y ⊆ K, X′ = Y and Y′ = X where X � X′ := {k ∈ K | ∀d ∈ X: (d, k ∈

I} and Y � Y′ := { d ∈ D | ∀k ∈ Y: (d, k) ∈ I} . X is called the extent and Y is called

the intent of the concept (X, Y).

To construct a conceptual structure, it is necessary to find all formal concepts of the

context C. The following formula is used to construct all concepts of C:

First, all extents X′ for all intents { d} ′ with d ∈ D is determined. Then the intents of all

determined extents are computed. The set of all formal concepts of C is designated by �

(D, K, I). �

(C) denotes the shortened form of �

(D, K, I).

}{ XX

′=′∈

dd

�

75

5.1.3. Concept Lattice

The concept lattice is the conceptual structure of FCA. To build a concept lattice it is

necessary to find the subconcept-superconcept relationships between all the formal

concepts �

(D, K, I). This is formalised by (X1, Y1) ≤ (X2, Y2): ⇔ X1 ⊆ X2 (⇔Y2 ⊆ Y1)

where (X1, Y1) is called a subconcept of (X2, Y2) and (X2, Y2) is called a superconcept

of (X1, Y1). The relation ≤ is called the hierarchical order of the concepts. The set of all

formal concepts of the context ordered by this subconcept-superconcept relationship, is

called the concept lattice of the context C denoted by £(D, K, I).

The line diagram in Figure 5.1 shows the concept lattice of C in Table 5.1. Each node

represents a formal concept (X, Y) where X is the set of documents and Y is the set of

keywords. In the proposed application, this structure is reformulated incrementally and

automatically by the addition of a new document with a set of keywords or by refining

the existing keywords of the documents. A more detailed explanation will be given in

the following section.

Figure 5.1. A concept lattice of the formal context C in Table 5.1.

({ 1, 2, 3, 4, 5} , { Artificial intelligence} )

({ 1, 2, 5} ), { Artificial intelligence, Knowledge acquisition} )

({ 1, 3} , { Artificial intelligence, Machine learning, Behavioural cloning} )

({ 4, 5} , { Artificial intelligence, Knowledge representation} )

({ 1} , {Artificial intelligence, Knowledge acquisition, Machine learning, Behavioural cloning} )

({2} , { Artificial intelligence, Knowledge acquisition, Knowledge engineering} )

({ 4} , { Artificial intelligence, Knowledge representation, Belief revision} )

({5} , { Artificial intelligence, Knowledge acquisition, Knowledge represen-tation, Ontology}

({}, { All keywords} )

76

5.2. Incremental Construction of a Concept Lattice

Incremental methods are used to generate a concept lattice starting from a single

document with its keywords set. The concept lattice is updated whenever a new

document is added with a set of keywords or the keywords of existing documents are

refined. The incremental algorithms in the literature focus on adding a new object into

the lattice. However, in the proposed application, users can refine the set of keywords of

their documents at any time if they desire. As a consequence, we could not directly use

the algorithms in the literature so that we chose to develop further incremental

algorithms to construct a concept lattice for our specific situation. However, we are not

aiming to prove correctness of the algorithms and are not making any claims of greater

efficiency that other algorithms, but the algorithms present a detailed description of the

implemented approach. For proofs of correctness of incremental algorithms refer to

Godin et al. (1994). Note that the study of Godin et al. (1994) only addressed the cases

of adding concepts, not for refining concepts already in the concept lattice of FCA.

5.2.1. Basic Definitions of the Algorithms

It is supposed that the existing formal context C = (D, K, I) where D is a set of

documents, K is a set of keywords and I is a binary relation between D and K. Recall

that a formal concept of a context C is a pair (X, Y) where X is the extent and Y is the

intent of the concept. The set of all formal concepts and the concept lattice of (D, K, I)

are denoted as � (C) and £(C), respectively. Now, let ext( � (C)) be the set of all extents

and int( � (C)) be the set of all intents of � (C), a revised formal context of C is defined

as follows for adding a document and refining the keyword set of an existing document.

Definition 3: Let C = (D, K, I) be a formal context, δ be a document and Γ be the set of

keywords of δ. For adding a new document δ (∉ D), the revised formal context of C is

defined as C+ = (D+, K+, I+) where D+ = D ∪ { δ} , K+ = K ∪ Γ and I+ = I ∪ { (δ, k) | k ∈

Γ} . In the case of refining the keywords of an existing document δ (∈ D), the set of

document D remains unchanged (D+ = D). The set of keywords K+ = K ∪ Γ \ { k ∈ K |

there does not exist a d ∈ D such that (d ≠ δ) and (d, k) ∈ I} . I+ is (I \ { (δ, k) ∈ I | k ∈

K} ) ∪ { (δ, k) | k ∈ Γ} .

77

5.2.2. Descr iption of the Algor ithms

Algorithm 1 describes the incremental algorithm for adding a new document δ with a

set of keywords Γ. Firstly, all possible new concepts from the new case (the document

and its keywords) are computed using the procedure computeNewConcepts. When the

procedure computeNewConcepts is completed, all possible formal concepts of the

revised context C+ are computed resulting in � (C+).

Secondly, the procedure reconstructLattice (Algorithm 2) is performed to reformulate

the subconcept and superconcept relationships for all formal concepts whose intent

includes at least an element of Γ of the new document δ. This results in a new lattice

£(D+, K+, I+) of C+.

__________________________________________________________________________________________________________

Input: C+ = (D+, K+, I+) - The revised context of (D, K, I)

£(C) - The concept lattice of (D, K, I) Γ - The set of keywords (of the new document δ)

Output: £(C+) - The concept lattice of the revised context C+ = (D+, K+, I+)

Procedure addDocument(C+, £(C), Γ) 1 Begin 2 � (C) ← the set of all concepts consist in £(C); 3 � (C+) ← computeNewConcepts(C+, � (C), Γ); 4 £(C+) ← reconstructLattice(£(C), � (C+), Γ); 5 Return £(C+); 6 End

__________________________________________________________________________________________________________

Algor ithm 1. The algorithm for adding a new document.

__________________________________________________________________________________________________________

Input: £ - A concept lattice

ℑ - A set of concepts Γ - A set of keywords

Output: £+ - The revised concept lattice of £

Procedure reconstructLattice (£, ℑ, Γ) 1 Begin 2 For each formal concept (X, Y) ∈ ℑ do 3 I f Y ∩ Γ ≠ φ then

4 £+ ← Reconstruct the superconcepts and subconcepts of the concept (X, Y) in £; 5 End if

6 End for 7 Return £+;

8 End __________________________________________________________________________________________________________

Algor ithm 2. Reconstruction of the concept lattice.

78

In the procedure computeNewConcepts (Algorithm 3), it is started by formulating a pair

(X, Y) where Y is Γ and X is the document set associated with Γ of the revised context

C+. The procedure addOneconcept is then performed to determine whether (X, Y) is a

new concept of C+. Secondly, the following process is applied for each element γ of Γ.

A pair (X1, Y1) is constructed where X1 is the set of documents which is associated with

the element γ and Y1 is the set of keywords associated with X1. Then, the procedure

addOneconcept is performed to determine whether the concept (X1, Y1) can be a new

concept of the revised context C+. Next, the intersection of X1 with the extent sets of ℑ+

is obtained by the definition intersect(X1) = {X1 ∩ E | E ∈ ext(ℑ+)} \ ext(ℑ+). Then, for

each element X2 ∈ intersect(X1) satisfying the condition: X2 ∉ ext(ℑ+), a concept (X2,

Y2) is formulated where Y2 is the set of keywords associated with X2, and the procedure

addOneconcept is performed for the concept (X2, Y2).

To incorporate a given pair (X, Y) into the concepts of C+ in the procedure

addOneConcept (Algorithm 4), it is determined whether Y is an intent of the given set

of concepts ℑ. If Y is a member of int(ℑ), then the extent X′ is obtained where (X′, Y)

∈ ℑ. If the cardinality of X′ is less than the cardinality of X, the existing concept (X′, Y)

__________________________________________________________________________________________________________

Input: C+ = (D+, K+, I+) - The revised context of C = (D, K, I) ℑ - The set of concepts of C

Γ - The set of keywords (of the new document δ) Output: ℑ+ - The set of concepts of C+ Procedure computeNewConcepts(C+, ℑ, Γ)

1 Begin 2 Y ← Γ; X ← { d ∈ D+ | dI+

k for all k ∈ Y} ; 3 ℑ+ ← addOneConcept(ℑ, (X, Y)); 4 For each γ ∈ Γ do 5 X1 ← { d ∈ D+ | dI+γ} ; Y1 ← { k∈ K+ | dI+k for all d ∈X1} ; 6 ℑ+ ← addOneConcept(ℑ+, (X1, Y1)); 7 intersect(X1) = { X1∩ E | E ∈ ext(ℑ+)} \ ext(ℑ+); 8 For each X2 ∈ intersect(X1) do 9 Y2 ← { k ∈ K+ | dI+k for all d ∈ X2} ; 10 ℑ+ ← addOneConcept(ℑ+, (X2, Y2)); 11 End for 12 End for 13 Return ℑ+; 14 End _________________________________________________________________________________________________________

Algor ithm 3. Construction of concepts connected with the new document.

79

_____________________________________________________________________________________ Input: ℑ - A set of concepts (X, Y) - A concept where X is the extent and Y is the intent of the concept (X, Y) Output: ℑ+ - A revised set of concepts of ℑ

Procedure addOneConcept (ℑ, (X, Y)) 1 Begin

2 ℑ+ ← ℑ; 3 I f X ≠ { } then55

4 I f Y ∈ int(ℑ) then 5 Let X′ ∈ ext(ℑ) such that (X′, Y) ∈ ℑ; 6 I f (|X′| < |X|) then 7 ℑ+ ← ℑ \ { (X′, Y)} ∪ { (X, Y)} ;

8 End if

9 Else 10 ℑ+ ← ℑ ∪ { (X, Y)} ; 11 End if 12 End if 13 Return ℑ+; 14 End _____________________________________________________________________________________

Algor ithm 4. Insertion of a new concept into the set of all concepts.

is eliminated from ℑ and the given concept (X, Y) is added into ℑ. Otherwise, the new

concept (X, Y) is not cooperated into ℑ. When Y is not a member of int(ℑ), the concept

(X, Y) is just appended as a member of ℑ.

Algorithm 5 describes the procedure for refining the keyword set of an existing

document. The main difference between Algorithm 5 (refining an existing document)

and Algorithm 1 (adding a new document) comes from the deleted keyword set of the

refined document δ (see difference between Algorithm 3 and 6). Algorithm 6

(refineConcepts) is started by computing the keyword set deleted from δ (Line 2: Γ2 ←

Γ1 \ Γ). If the set is empty, the rest of Algorithm 6 is essentially the same as Algorithm

3. If the deleted set is not empty, then Algorithm 7 is performed first to refine the

concepts which include the deleted keywords. The concepts are eliminated or revised

according to the cardinality of their extent.

55 In the Basic Theorem of FCA the concept ({ } , { all attributes} ) is added as an element of the given

concepts set. However, in our implementation, the node of the concept ({ } , { all attributes} ) is added when

the lattice is constructed only if i t is necessary. Thus, we do not consider the case that its extent is empty.

80

_____________________________________________________________________________________

Input: C+ = (D+, K+, I+) - The revised context of (D, K, I) £(C) - The concept lattice of (D, K, I)

δ - A refined document Γ - The new set of keywords of δ Γ1 - The old set of keywords of δ

Output: £(C+) - The lattice of the revised context C+ = (D+, K+, I+)

Procedure refineDocument(C+, £(C), δ, Γ, Γ1) 1 Begin 2 � (C) ← the set of all concepts consist in £(C); 3 � (C+) ← refineConcepts(C+, � (C), δ, Γ, Γ1);

4 £(C+) ← reconstructLattice(£(C), � (C+), Γ ∪ Γ1); 5 Return £(C+);

6 End _____________________________________________________________________________________

Algor ithm 5. The algorithm for refining an existing document.

_____________________________________________________________________________________ Input: C+ = (D+, K+, I+) - A revised context of C = (D, K, I)

ℑ - A set of concepts δ - The refined document Γ - The new set of keywords of δ Γ1 - The old set of keywords of δ

Output: ℑ+ - A refined set of concepts of ℑ

Procedure refineConcepts(C+, ℑ, δ, Γ, Γ1) 1 Begin

2 ℑ+ ← ℑ; 3 Γ2 ← Γ1 \ Γ; 4 I f Γ2 ≠ φ then 5 ℑ+ ← refineConceptsForDeletedTerms(ℑ+, δ, Γ2); 6 Endif 7 Y ← Γ; X ← {d ∈ D+ | dI+k for all k ∈ Y} ; 8 ℑ+ ← addOneConcept(ℑ+, (X, Y)); 9 Γ3 ← Γ \ Γ1; 10 For each γ ∈ Γ3 do 11 X1 ← { d ∈ D+ | dI+γ} ; Y1 ← { k∈ K+ | dI+k for all d ∈X1} ; 12 ℑ+ ← addOneConcept(ℑ+, (X1, Y1)); 13 intersect(X1) = { X1∩ E | E ∈ ext(ℑ+)} \ ext(ℑ+); 14 For each X2 ∈ intersect(X1) do 15 Y2 ← { k∈ K+ | dI+

k for all d ∈ X2} ; 16 ℑ+ ← addOneConcept(ℑ+, (X2, Y2)); 17 End for 18 End for 19 Return ℑ+;

20 End _____________________________________________________________________________________

Algor ithm 6. Refinement of concepts connected with the refined document.

81

_____________________________________________________________________________________ Input: ℑ - A set of concepts

δ - The refined document Γ2 - A set of keywords Output: ℑ+ - A refined set of concepts of ℑ Procedure refineConceptsForDeletedTerms(ℑ, δ, Γ2)

1 Begin 2 ℑ+ ← ℑ; 3 For each concept (X, Y) ∈ ℑ 4 I f Y ∩ Γ2 ≠ φ & δ ∈ X then 5 I f |X| = 1 then 6 Y′ ← Y \ Γ2; 7 ℑ+ ← ℑ+ \ { (X, Y)} ∪ { (X, Y′)} ; 8 Else if |X| > 1 then 9 X ′ ← X \ { δ} ; 10 I f X′ ∉ ext(ℑ) then 11 ℑ+ ← ℑ+ \ { (X, Y)} ∪ { (X′, Y)}; 12 Else 13 ℑ+ ← ℑ+ \ (X, Y); 14 End if 15 End if 16 End if 17 End for 18 Return ℑ+; 19 End __________________________________________________________________________________________________________

Algorithm 7. Refinement of the concepts which include the keywords deleted.

We briefly present the worst case time complexity for the algorithms. In the worst case,

addOneConcept() is O(|ℑ|) where |ℑ| denotes the number of formal concepts (i.e., the

size of the concept lattice). However, as int(ℑ) is implemented by a hash table mapping

the keys from (int(ℑ)) to the values in (ext(ℑ)), on average, this function is likely to be

more efficient, i.e., O(1). There is also a tradeoff between time and space costs. The

worst case complexity for computeNewConcepts() is O(|ℑ|2|Γ|) where |Γ| is the number

of keywords of the new document. Therefore, the time complexity, in the worst case, for

adding a new document - addDocument() - is O(|ℑ|3|Γ|). In practice, the complexity of

addDocument() is approximately estimated at O(|ℑ|2|Γ|) as the complexity of

addOneConcept(), on average, is O(1). For refining an existing document: the time

complexity of function refineDocument(), in the worst case, is O(|ℑ|4|Γ|) because there

is another For loop to refine the concepts which include the deleted keywords of a

refined document. However, on average, this complexity can be reduced to O(|ℑ|3|Γ|)

for the use of hash table in our implementation. Note that the complexity of incremental

algorithms given in Chapter 4 is for adding a new object.

82

5.3. Document Management

In this proposed approach, users themselves evolve their own organisation of documents

based on the free annotation of their documents with a set of keywords. They can also

refine the keywords of existing documents at will. When a user assigns keywords to a

document, some keywords may be missed and/or may be prompted by stored

documents or domain knowledge. Thus, it is appropriate to have certain knowledge

acquisition mechanisms to improve the search performance of the system as it evolves.

Figure 5.2 shows the overview of the annotation process of keywords for a document.

Figure 5.2. The annotation process of keywords for a document.

Yes

Yes

Start

Initial assignment of keywords - Select terms from the l ist provided and/or input any terms

Further assignment of keywords - Select further terms if desired

Display all keywords used by others and terms from imported taxonomies - alphabetically - or as used by various annotators

Display - Terms suggested from taxonomies relevant to the keywords of initial assignment - Terms that co-occur in the lattice with the keywords of initial assignment

Can view - Relevant documents for the current selection

- Hierarchical views of the taxonomies - Documents that are related to the terms - Concept relationships (lattice-structure)

I f another document at node

Differentiate the current document and the previous document by selecting and/or adding terms

Get a location of a node in the lattice for the document

Reformulate the concept lattice to include the new document

Display all keywords used by others

View - Concept relationships (lattice-structure)

I f satisfied

End

Extract documents that may be relevant with the new added keywords from the lattice and pass them to a knowledge engineer

No

No

(Re-assigning keywords)

Until no more documents at the node

83

A number of knowledge acquisition techniques are used to suggest possible annotations

during this annotation process. The knowledge acquisition for conceptual modelling has

been a fairly centralised process in most expert system applications. In contrast, our

system guides the users (or annotators) to capture missed concepts by suggesting

possible keywords, rather than an expert. The annotation process is divided into a

number of phases. Each phase is presented in detail in the following sections.

5.3.1. Phase One: Reusing Terms in the System

The system is expected to be collaboratively developed and maintained over time in

Web-based distributed environments by authors (users). Hence, it is crucial to guide

them to use controlled vocabularies, if they exist, to help ensure consistency of the

system. This is accomplished by reusing and sharing terms already in the system.

When a user annotates their document with a set of keywords, the user is provided with

a list of keywords already available in the system that have been added by others in

alphabetical order. The system also provides a list of keywords based on each of

annotators. The user can select keywords from them or can enter further text words

which in turn will be available to future users. The system also displays domain terms

from imported taxonomies in a list combined with the terms used by others.

5.3.2. Phase Two: Using Impor ted Terms from Taxonomies

Information retrieval often incorporates the use of thesauri or taxonomies as background

knowledge to extend a user’s query. A number of researchers (Carpineto and Romano

1996a; Cole and Eklund 1996a; Priss 2000a) have utilised a domain thesaurus for their

lattice retrieval processes and presented experimental evidence that adding a thesaurus

to a concept lattice improves its retrieval performance.

However, there is little point in considering inheritance as part of a reasoning

mechanism with taxonomies. This is because commonly available taxonomies, thesauri

or classifications are too general. So that when such a taxonomy is applied to a

84

particular domain, the inheritance between the terms in the hierarchy of the taxonomy

are not often transitive.

To observe the situation, we start with the subsumption hierarchy of a taxonomy which

consists of a partially ordered set ( � , ≤) and a context ( �� ). The context assigns the

related taxonomy terms (∈ � ) to the documents in the set � . � is the set of terms in the

taxonomy. When the taxonomy is involved in an information retrieval process, the

following compatibility condition56 is assumed for a subsumption hierarchy.

∀d ∈ � , k, l ∈ � : (d, k) ∈ � , k ≤ l � (d, l) ∈ �

The problem is that this compatibility condition is not always inheritable when the

taxonomy is used for a specific application domain. For example, suppose there is a

document d ∈ � and two terms (logic < knowledge representation) ∈ � from Figure 5.3

(a), and assume d is associated with the term logic ((d, logic) ∈ � ). But d may or may

not be relevant to the term knowledge representation ((d, knowledge representation) ∈

� ) or ((d, knowledge representation) ∉ � )). In other words, logic is a typical method for

knowledge representation and reasoning. Therefore, d can be only relevant to some

reasoning mechanisms using logic, but not knowledge representation techniques or vice

versa.

For another example, assume there is a document d ∈ � and two terms (data mining <

artificial intelligence) from Figure 5.3 (c), and suppose d is related to the term data

mining ((d, data mining) ∈ � ). But the term artificial intelligence may or may not be

coupled to the document d. Rather d might be related to data mining with database

techniques to discover relationships between data items or association rules from

databases. Therefore, reasoning by transitivity does not always improve retrieval

performance and can even give poor results (Nikolai 1999).

56 Stumme (1999) and Nikolai (1999) describe this problem in their papers. We will explain the problem

following the compatibil ity notion of Stumme. Note that this situation will be different in a well-defined

ontological taxonomy which is built to hold the semantics of an “ is-a” relation (Guarino and Welty 2000).

85

Figure 5.3. Examples of hierarchies extracted from taxonomies.

(a) ACM Computing Classification System57.

(b) ASIS&T thesaurus for Information Science58. (c) Open Directory Project59.

However, thesauri or taxonomies have significant implications for understanding,

sharing, reuse and integration. The use of these hence is common as background

knowledge of a reasoning mechanism.

The only browsing mechanism we propose is FCA so there is little point in considering

a reasoning mechanism. Thus, we also import a number of taxonomies as others do, but

we try to solve the compatibility problem described above. Consequently, we import

only the domain terms from the relative thesauri or taxonomies to the evolving domain,

but the inheritance between the terms of an ordered set remains on the conceptual

structure of FCA. In other words, in our approach the value is in suggesting all parents

of the term in a list, rather than the hierarchy itself. The inheritance between the terms

of ordered sets is a question for knowledge acquisition and remains on the conceptual

structure of FCA.

57 ASIS&T (American Society for Information Science and Technology) -

http://www.asis.org/Publications/Thesaurus/isframe.htm (2002). 58 ACM (Association for Computing Machinery) -

http://www.acm.org/class/1998/ccs98.html (2002). 59 http://dmoz.org (2002).

��

�� !�

"$#

%'&)(+* , �� -�� -� �� !.� ��

"$#

(a)

�� /� �� 0�� -� ��

�� 1� �� '2�. � �� -� �!�

�� 3��-� �!�

"$#-#

4 � �� -�� -� ��

(b) (c)

�5�� 3�

67��98�� -��

�� !�

: �� ; < ��= �� !� � � �

4 ��>?�� @$��<

A � �B�!� ��!� ��

"C#-#

DE��F>0� ��

86

When a new document is added and the user enters a term that occurs in the taxonomy

all parents of the term up the hierarchy (i.e., predecessors) are also displayed. Any of

these terms can be selected by the user and added to the document as keywords. This

means that the various superclasses are considered as Boolean attributes. The user can

judge whether the suggested terms are applicable to their document, and is free to select

none, one or some of these terms without considering any hierarchies of the terms.

Then, FCA constructs a concept lattice which holds inheritance between terms in the

lattice hierarchies in relation to documents.

There are good reasons for having taxonomies, but the initial motivation for developing

mechanisms for combining both inheritance and heuristic reasoning was probably to

avoid a user having to enter the entire taxonomy. For example, if the user had already

entered “dog” it would be inappropriate to ask them to enter “mammal” and “animal” as

well to allow the appropriate rules to fire. Here however we are not asking the end-user

browsing the system to enter such terms, but the user (annotator) who is entering a

document, and the hierarchy is used to suggest which terms may be relevant.

Figure 5.3 shows partial examples of the taxonomies we import. The ACM computing

classification system and ASIS&T thesaurus for Information Science have been

imported from commonly available Web sites. A hybrid called UNSW has been

developed based on the Open Directory Project, the KA2 community Web60, and the

research areas at the School of Computer Science and Engineering61, UNSW.

The reason for importing from a number of different ontologies is that none of the

taxonomies or thesauri have the same hierarchy for the same area as seen in Figure 5.3.

Clearly, it is essential to develop a mechanism which can make the existing taxonomies

suitable for an application domain, with more emphasis on the significance of context.

60 The research topics of the KA2 portal; http://ka2portal.aifb.uni-karlsruhe.de/ (2002). 61 http://www.cse.unsw.edu.au/school/research/research2.html and

http://www.cse.unsw.edu.au/school/research/currresearch.html (2002).

87

Algorithm 8 describes the process of importing terms from taxonomies. For instance,

we suppose that the term “ontologies” is included in the newly added document, then

the system extracts all parents of the term up the hierarchy (i.e., predecessors) of a

taxonomy (here “artificial intelligence” and “knowledge representation” from the

hierarchy of Figure 5.3(c)). Next, the term “knowledge engineering” is triggered from

the hierarchy of Figure 5.3(b) as the parent of the term “knowledge representation” .

This process is repeated until no further terms are provided by the hierarchies. Then,

the union set of the parent terms is suggested to the user in a list. The user is free to

select any terms from the list to be added to the document.

________________________________________________________________________________________________________

Input: Γ - The set of keywords of a new adding document Output: U - The union set of imported terms

Procedure importTerms(Γ) 1 Begin

2 U ← { } 3 For each taxonomy do

4 Ui ← { } 5 For each element γ of Γ do

6 Find all parents of the term up the taxonomy hierarchy 7 U′ ← Union set of the parent terms

8 Ui ← Ui ∪ U′

9 For each element term t of U′ do 10 Repeat until the parent terms are empty

11 Find all parents of the term up the taxonomy hierarchy 12 U″ ← Union set of the parent terms

13 Ui ← Ui ∪ U″

14 End repeat 15 End for 16 End for 17 U ← U ∪ Ui \ Γ 18 End for 19 Return U 20 End

________________________________________________________________________________________________________

Algor ithm 8. The algorithm for importing terms from taxonomies.

5.3.3. Phase Three: Using co-occurred Terms in the Lattice

FCA is also used for knowledge acquisition (Wille 1992; Erdman 1998; Stumme 1998)

and knowledge discovery in databases (Stumme et al. 1998; Hereth et al. 2000; Wille

2001) to discover concepts and rules related to objects and their attributes. This

88

approach is based on a strong idea of context with its use of parent child-relations

between concepts in a graphically represented concept lattice. In common with most

knowledge acquisition techniques, its power is in the way it presents relationships

across the whole domain, and most FCA work attempts to display the whole lattice as

noted in Chapter 4. The general principle is stil l to give the expert a view of the whole

domain so that all relevant concepts will be included.

We have argued that experts more easily provide concepts that distinguish between

cases (Compton and Jansen 1990). The expert’s attention is focused on relevant cases

when the system misapplies a concept to a case. The expert is then asked to distinguish

between this case and a case the system retrieves where the concept was appropriate.

This is a more strongly situated view of knowledge acquisition with more emphasis on

the significance of context. Motivated by considering specific objects and cases, we

have implemented FCA in a similar way to both RDR and Repertory Grids. When a

document is added keywords that co-occur with the keywords the user has assigned, are

suggested by taking into account specific documents and keywords in the lattice.

However, the user can navigate the whole lattice while adding a new document, as a

view of the overall relationships between objects and attributes is also important.

The following describes how all co-occurring keywords are retrieved and then ranked

for possible relevance before being presented to the annotator. To obtain keywords that

are related with the added case in the lattice, the following definitions are used:

Definition 4: Let C = (D, Κ, Ι) be a formal context and Γ be a set of keywords (Γ ⊆ Κ).

Then the set of documents associated with Γ is defined to be ∆Γ = {d ∈ D | ∃k ∈ Γ such

that (d, k) ∈ Ι} .

∆Γ is introduced to get a set of documents, which have at least one keyword of Γ. If Γ is

a singleton (i.e., Γ= { γ} ), then we will abbreviate ∆γ = ∆Γ = {d ∈ D | (d, γ) ∈ Ι} .

Definition 5: Let C = (D, K, I) be a formal context. A function ƒ from D to 2K is defined

as ƒ: D � 2K such that ƒ(d) = {k ∈ K | (d, k) ∈ Ι} .

89

That is, ƒ(d) returns the set of keywords of d. Let the new document be δ (∉ D) with

the set of keywords Γ. We formulate the sub-formal context C′ = (D′, K′, I′) with D′=

∆Γ ∪ {δ} where ∆Γ is as given in Definition 4 and K′ = ƒ(d) where ƒ is the function

in Definition 5. In order to get keywords already associated with δ, we first obtain a set

of keywords which are associated with ∆Γ as ƒ(∆Γ) = ƒ(d) from the context C′.

Now the set of co-occurred keywords is defined as ℜ = ƒ(∆Γ) - Γ. Then, the function W

introduced below is used for each keyword k of ℜ to compute the number of common

keywords of Γ with the keywords of all the documents that have the keyword k from C′.

Definition 6: A function W from 2K × ℜ to the set of natural numbers N is defined as

follows: W: 2K × ℜ � N such that W (Γ, k) = | ƒ(d) ∩ Γ | where |X| is the

cardinality of X.

Let us have a look this process in more detail. A user can annotate his/her document

with a set of keywords by entering any terms or selecting given terms. The system

displays all the keywords used by other annotators to facilitate sharing and reuse. After

this initial assignment, the user can view the other terms that co-occur with the terms

s/he has provided and can annotate the document with these further terms if desired.

The terms are presented to the user ordered by their frequency and normalised for the

number of terms at the node, and their “closeness” to the node to which the document is

assigned by the user’s initial choice of terms in the conceptual hierarchy (i.e., weight).

More precisely, an ordered set of documents and a set of keywords which might be

relevant to the new document are obtained. A sub-lattice £′ (D′, K′, I′) of the formal

context C′ described above is then constructed. This step is divided into two stages. In

the first stage, the relevant documents are obtained, ordered by their similarity with the

new document. Given a new document δ, we are interested in finding the set of

documents that share some commonalties. We formulate a formal concept ζ = ({ δ} ,

ƒ(δ)) with the newly added document δ and its keywords set Γ. Starting from the

concept ζ we recursively go up to the direct superconcepts in the lattice to find the next

level of relevant documents. This procedure is continued until the superconcept reaches

the top node of the lattice.

∪′∈Dd

d Γ∈∆∪

�

∆∈ kd

90

Figure 5.4. A lattice £(D′, K′, I′) of the formal context C′ from Figure 5.1.

For instance, suppose that there is a concept lattice as shown in Figure 5.1 and, a new

document δ (6) is added together with its set of keywords Γ { knowledge representation,

ontology, knowledge management} . Then, we formulate the sub-context C′ = (D′, K′,

I′) where D′ = ∆Γ ∪ { δ} = { 4, 5, 6} , K′ = f(d) = { artificial intelligence, knowledge

acquisition, knowledge representation, belief revision, ontology, knowledge

management} and I′ is a binary relation between D′ and K′. The sub-lattice £(D′, K′, I′)

of the context C′ can be constructed as shown in Figure 5.4. The grey coloured box

indicates the formal concept ζ. From the lattice we can get document “5”, as it exists in

the direct superconcept of ζ in the lattice, and as such is the most relevant to the

document “6” . Next document “4” is obtained. Finally, an ordered set of documents {5,

4} relevant to document “6” in the lattice are obtained. The ordered documents are then

viewed by the user along with the relevant features.

At the second stage, firstly, we elicit the terms that co-occur in the lattice with the terms

the user has provided (ℜ). Secondly, a weight for each co-occurred term is calculated

by Definition 6 (W). Next, the terms are ordered by their calculated weight and the

ordered list is presented to the user with their weight. For example, let a new document

δ be “6” and the set of keywords Γ of δ be {knowledge representation, ontology,

knowledge management} . Then, we get a set of documents associated with Γ, ∆Γ = {4,

5} by Definition 4 from the sub-context C′ = (D′, K′, I′) as shown in Figure 5.4. Next,

�

′∈Dd

({ 4} , { Knowledge representation

Artificial intelligence, Belief revision} )

({ 6} , { Knowledge representation,

Ontology, Knowledge management} )

({4, 5} , { Knowledge representation,

Artificial intelligence} )

({ 5} , { Knowledge representation,

Artificial intelligence, Ontology,

Knowledge acquisition} )

({ } , { All keywords} )

({ 4, 5, 6}, { Knowledge representation} )

({5, 6} , { Knowledge representation,

Ontology} )

91

the set of keywords which are associated with ∆Γ is obtained: that of ƒ(∆Γ) = {artificial

intelligence, knowledge acquisition, knowledge representation, belief revision,

ontology} by Definition 5. Following that, we obtain the set of co-occurred terms as ℜ

= ƒ(∆Γ) - Γ = {artificial intelligence, knowledge acquisition, belief revision} . Because

the set of terms in ℜ is a candidate for expanding the keywords already associated with

δ. Then, for each element of ℜ, a weight is calculated by Definition 6 as follows: W(Γ,

artificial intelligence)=3, W(Γ, knowledge acquisition)=2 and W(Γ, belief revision)=1.

Now, an ordered list is presented to the user with their weight. The user can select any

keywords that are relevant. Through this process, the user can capture some relevant

keywords while adding a new document. Of course, none of the mechanisms here are

seen by the user, who sees only a ranked list of keywords. The user can also view the

sub-lattice and the relevant documents for each of the co-occurred terms during this

process.

5.3.4. Phase Four: Identifying related Documents

In the Ripple-Down Rules (RDR) method, an expert is only required to identify features

that differentiate between a new case being added and the other stored cases already

correctly handled. That is, the main technique of knowledge acquisition in RDR which

is similar to the use of differences in Personal Construct Psychology (Gaines and Shaw

1990). A rule is only added to the system when a case has been given a wrong

conclusion. Any cases that have prompted knowledge acquisition are stored along with

the knowledge base. RDR does not allow the expert to add any rules which would result

in any of these stored cases being given different conclusions from those stored unless

the expert explicitly agrees to this. We import this RDR technique to help a user in

finding appropriate keywords for documents.

When the assignment of keywords for a document is complete the document can be

located at more than one node in the lattice. One node in particular is unique and has the

largest intent among the nodes where the document is located. If there is another

document already at the node, the user adding the new document is presented with the

92

previous document and asked to include keywords that distinguish the documents. The

user can choose to leave the two documents together with the same keywords.

Ultimately however, every document is unique and offers different resources to other

documents and probably should be annotated to indicate the differences. The approach

used is derived from RDR, but the location of the document is determined by FCA

rather than the history of the development in RDR.

In the RDR approach, when a new rule is added, all stored cases that can reach the

parent rule (cornerstone cases) are retrieved. Then the user is required to construct a rule

which distinguishes between the new case and the cornerstone cases until it excludes all

cornerstone cases. In this document retrieval system, a case which has the same set of

keywords as the new document becomes the equivalent of a cornerstone case. If a

cornerstone case exists, the system displays all the keywords used by other annotators.

The user should select at least one different feature (keyword) from the deployed

keywords or specify a new term to distinguish the cornerstone case(s) from the new

case. This process is continued until the user is satisfied. A key difference from RDR is

that RDR rules allow negations so that a child rule may be more general than its parent.

This allows the historical dependencies of the rules to be maintained. Since the FCA

lattice is constantly regenerated, it is reasonable for a distinguishing term to be added to

a cornerstone case rather than the new case if preferred. However, this may also be

referred to the owner of the original document or to a system manager.

5.3.5. Phase Five: Adding New Terms

Another mechanism of the annotation support tools to facilitate knowledge acquisition

is triggered when a new term is entered for a new document; this term may also apply to

other documents located at the parent nodes of the new node in the lattice. This problem

could be left until the system fails to provide an appropriate document for a later search

as in the RDR approach. However, in this proposed approach the system extracts those

relevant documents at the direct parent nodes in the lattice and passes them to a

knowledge engineer, who is able to examine whether the suggested documents should

have the new keyword. The following definitions are used in formulating the relevant

documents and their associated new terms:

93

Definition 7: Let £ = < V, ≤ > be a lattice. Given a node θ ∈ V, the set of direct parents

of θ denoted DP£ (θ) is defined as follows: DP£ (θ) = { α ∈ V | θ < α and there does not

exist any β ∈ V such that θ<β & β<α} .

Definition 8: Let £(C) be a concept lattice of the formal context C = (D, K, I) and δ be

the new document. For each document d ∈ D, the set of relevant keywords for d with

respect to δ denoted Relδ (d) is defined as follows:

Relδ (d) = { ƒ(δ) \ }

For instance, suppose that a new document 6 (δ) with the set of keywords Γ { knowledge

representation, ontology, knowledge management} is added into the lattice shown in

Figure 5.1, then the lattice structure will be reformulated to cope with the new case.

Figure 5.4 can be a part of a reconstructed lattice which has a new node ζ ({ δ} ,ƒ(δ))

coloured grey. Now, for the documents located in the direct parent node of ζ (here

document 5), we extract the relevant keywords with respect to d by Definition 8: Relδ

(d) = Rel6(5) = { knowledge management} . The system then passes this case to a

knowledge engineer to determine whether or not document 5 should have the keyword

“knowledge management”. Although it is possible to apply the term to higher parent

nodes, for convenience only direct parents are considered.

5.3.6. Phase Six: Logging Users’ Quer ies

Another mechanism is activated when the system cannot find a node in the lattice with a

user query. In this case, the system sends a log file to a knowledge engineer so s/he can

decide if more appropriate keywords are required for documents through an interface

supported by the system. If the knowledge engineer makes a decision, then the system

automatically sends an e-mail to the author(s) (annotator of the document) with a

hyperlink which can facilitate the refinement of the keywords of the document. All

interactions between the system and users are also logged to find factors which may

influence the search performance of the system.

�

X & )))(} ,(({ Y)(X, £(C)

Y∈∈ δdfdDP

94

5.4. Document Retrieval

Lattice-based retrieval is based on navigating the lattice structure of Formal Concept

Analysis. In this approach, the lattice is used as a basic data structure either for indexing

documents or for browsing. A node of the lattice consists of a concept with a pair (X, Y)

where X is the extent (a set of documents) and Y is the intent (a set of keywords) of the

concept. The intents of each concept are used for the indexing terms of the browsing

structure. Document retrieval in our approach followed this lattice-based model.

The central advantage of lattice browsing is that one can navigate down to a node by

one path, and if a relevant document is not found one can go back up another path rather

than simply starting again. When one navigates down a hierarchy, one tries to pick the

best child at each step. If the right document is not found, it is difficult to know what to

do next, because one has already made the best guesses possible at each decision point.

However, with a lattice, the ability to go back up via another pathway to the node opens

up new decisions, which one has not previously considered. The conventional

hierarchical structure can be also embedded in this lattice structure.

Another strong feature of FCA for browsing is that the concept lattice holds the

inheritance hierarchical relationship among the evolved attributes (keywords) in the

lattice structure. The lattice also implies all minimal refinements and minimal

enlargements for a query at an edge in the lattice (Godin et al. 1995). This means that

following an edge upward (downward) corresponds to a minimal refinement

(enlargement) for the query at the edge in the lattice, and vice versa. In other words, the

intent (keywords) of each node can be considered as a conjunctive query, and the extent

(documents) of the node is the search result with the query. Traversing edges upward

from the query can deliver all minimal enlargements of the query in the lattice.

For example, let a user’s query be a∧c∧g∧h and the corresponding node (34, acgh) in

Figure 5.5. The conjunctive intents (a∧g∧h, a∧c) of the direct parents of the node

(acgh) are all minimal enlargements of the query (a∧c∧g∧h), and the conjunctive

intents (a∧c∧g∧h∧i, a∧b∧c∧g∧h) of the direct children nodes of the node are all

95

minimal refinements of the query. Therefore, the lattice structure can be used as a

refinement tool for users’ Boolean queries in the evolved domain.

More importantly, the lattice with a set of documents and their keyword sets is scaled

with an ontological structure for the attributes of the evolved domain (i.e., conceptual

scaling). This allows a user not only to get more specific results, but also to search

relevant documents by the interrelationship between the document keywords and the

domain attributes. A more detailed explanation of the use of conceptual scaling will be

presented in Section 5.5.

The user can also view the lattice using one of the imported taxonomies available in this

case - the ACM, ASIS&T, Open Directory Project and UNSW taxonomy introduced in

the previous section. The system recreates the lattice assuming that any document with a

term from the imported taxonomy also has all the parent terms for that term. One can

browse this lattice or alternatively one can navigate the lattice without any involvement

of a taxonomy at any stage.

Figure 5.5. An example of a lattice structure.

Numbers denote documents and alphabet characters indicate keywords. A node represents a

concept with a pair (X, Y) where X is the extent and Y is the intent of the concept (X, Y).

(3, abcgh)

(123, abg)

({}, abcdeghi)

(4, acghi) (6, abcdf)

(56, abdf)

(7, acde)

(34, acgh)

(678, acd) (36, abc)

(234, agh)

(5678, ad)(12356, ab) (34678, ac)

(1234, ag)

(12345678, a)

96

5.4.1. Browsing the Lattice Structure

A user can interact with the system starting from the root of the lattice exploring the

relationships of the concepts from vertex to vertex of the lattice without any particular

query being provided. We simplify the lattice display by showing only direct neighbour

nodes using hyperlinks. The children and parents are hypertext links and a user

navigates these links by clicking on a parent or child node. We can see how navigation

is carried out, with a simple lattice structure as shown in Figure 5.5.

Suppose that the user’s query “a” , the system will display a set of documents of the

concept “a” 62 in a result space and will show the concepts “ag” , “ac” , “ab” and “ad” as

more specialised nodes of “a” in a browsing space. Only direct neighbours of the node

are displayed. If the concept “ac” was chosen, then the system will show its parent

concept “a” and its child concepts “acgh” , “abc” and “acd” . Next, if we suppose that

the concept “abc” was selected, then the parent concepts “ac” and “ab” , and the child

concepts “abcgh” and “abcdf” will be displayed. The user can again navigate up or

down at this stage, or can move to the root of the lattice.

5.4.2. Enter ing a Boolean Query

The user can formulate a query by entering any text words in a conventional Boolean

query interface or selecting terms from a list given by the system, and can navigate the

lattice structure starting with a node covering the user’s query. A set of words can be

separated by commas (,) assuming the AND Boolean operator. The query is normalised.

In other words, firstly all stopwords63 are eliminated from the query. Secondly, the

terms in the query are stemmed using the stemming classes64. Following this, the system

62 Precisely speaking, “a” is the intent of the concept (12345678, a), but here we simply use the term

concept, with only the intent of the concept. Note that the intents of concepts are used indexing terms for

a lattice. 63 A knowledge engineer built a stopword list referring to a number of stopwords available on the Web. 64 The stemming classes have been downloaded from (http://ci ir.cs.umass.edu/whatsnew/stemming.html,

2000). The entire 5.5-gigabyte TREC 1 - 5 collection was used to create the stemming classes by merging

the Porter and K-Stem stemming algorithms which gave the overall best result on the TREC6

experiments. The purpose of the stemming is to deal mainly with the plural problem, rather than

sophisticated morphological processing. We also added terms related to our test domains.

97

identifies a relevant node in the lattice with the normalised query and directly moves to

the relevant portion. Note that when a lattice is formulated we also normalise the terms

in each intent of the concepts in the lattice to match them with the normalised query.

To find a relevant node with the user’s query in the lattice we recall again that a formal

concept is a pair (X, Y) where X is the extent (a set of documents) and Y is the intent (a

set of keywords) of the concept. The set of all formal concepts in the concept lattice of a

context C is denoted as � (C) and the set of all intents of � (C) as int( � (C)). Now, let a

user’s query be Q, if a concept c ∈ � (C) satisfies the following conditions, then c is the

relevant portion of the query Q in the lattice of C.

(i) Q ⊆ int(c)

(ii) For each set of keywords (intent) α ∈ int( � (C)) if Q ⊆ α, then int(c) ⊆ α

For instance, if we take the lattice shown in Figure 5.5 and a user’s query “a∧c” , then

the node (34678, ac) will be the starting point of the navigation with the query. The

system will display a set of documents (3,4,6,7,8) of the concept “ac” in a result space,

and its parent concept “a” and its child concepts “acgh” , “abc” and “acd” in a

navigation space. With a query “a∧b∧d” , the node (56, abdf) will be the starting point of

the navigation.

A relevant portion may not exist in the lattice with a given query Q. In this case,

documents are retrieved which contain the query anywhere in the contents of the

documents (text words search), and the system formulates a sub-lattice using the result

documents and their keywords. Navigation can be done on this sub-lattice.

To provide more flexible retrieval options we also display the keywords which subsume

the user’s query if such keywords exist. For instance, suppose that a user’s query is

“compiler”, then the system will display the search results of the query, and also display

keywords that subsume the query such as “compiler construction”, “complier

techniques”, “dynamic compliers” and “ incremental compilers” (⊇ compiler). These are

shown as a list and are hypertext links to the appropriate part of the lattice. The

98

standard retrieval mechanism on the concept lattice can be considered as phrase

searching65 combined with the AND Boolean operator. For example, let us take a node

({ 1, 2, 5} , { artificial intelligence, knowledge acquisition} ) from Figure 5.1. Then, it can

be regarded that the documents 1, 2 and 5 are the search results for the query which

consists of two phrases “artificial intelligence” and “knowledge acquisition” with a

conjunction of two phrases (i.e., “artificial intelligence” AND “knowledge

acquisition”). Thus, the deployment of the subsuming keywords of a user’s query will

be useful when the user’s query is a part of the phrased keywords.

5.5. Conceptual Scaling

Conceptual scaling has been introduced in order to deal with many-valued attributes

(Ganter and Wille 1989; 1999). According to the basic theory of conceptual scales of

FCA, each attribute, or a combination of more than one attribute of the many-valued

context, can be transformed into a one-valued context. The derived one-valued context

is called a conceptual scale. Then, if one is interested in analysing the interrelationship

between attributes, s/he can choose and combine the conceptual scales which contain

the required attributes. This process is called conceptual scaling.

More fundamentally, conceptual scaling is to deal with many-valued Boolean attributes

which hold multiple inheritance within a one-valued context of FCA. The essence of

conceptual scaling is to impose on this a single inheritance hierarchy or equivalently

some of the Boolean attributes are reorganised as being mutually exclusive values of

some unnamed attributes. Either way there is recognition that a group of Boolean

attributes are mutually exclusive. In conceptual scaling, one selects one of the mutually

exclusive attributes from a set and a sub-lattice containing these values is shown. A

number of attributes selection can be made at the same time to give the sub-lattice.

Existing attributes can be used as the parent of group of mutually exclusive attributes or

new names for the grouping can be created.

65 Phrase searching is a feature which allows a user to find documents containing certain phrases. When

phrase searching is used, only documents which contain the phrase are retrieved.

99

There are two ways in which we use conceptual scales in the proposed system. Firstly, a

user or a system manager can group a set of keywords used for the annotation of

documents. The groupings are then used for conceptual scaling. Secondly, other

ontological information can also be used where readily available (e.g., person, academic

position, research group and so on). These correspond to the type of more structured

ontological information used in the system such as KA2 (http://ka2portal.aifb.uni-

karlsruhe.de/). The key point of the proposed approach is flexible evolving ontological

information but there is no problem with using more fixed information if available. We

have included such information for interest and completeness in conceptual scaling.

These conceptual scales allow a user to get more specific results and to reduce the

complexity of the visualisation of the browsing structure as well as to search relevant

documents by the interrelationship between the domain attributes and the keywords of

documents.

An intended purpose of conceptual scaling is to support a hybrid browsing approach by

connecting an outer structure with keyword sets of documents (taxonomies) and a inner

nested structure with ontological attributes (ontological structure). The ideal would be

to support both approaches simultaneously because the organisation of background

knowledge, not only with the vocabularies in taxonomies but also with ontological

structures in the form of properties, would be useful for navigating information.

It should be noted that, in referring to the term “ontology” here, we are neither dealing

with a formal ontology which uses relations, constraints, and axioms, nor providing

automated reasoning based on implied inter-ontology relationships. Rather our aim is a

browsing mechanism suitable for specialised domains.

Conceptual scaling will be explained using examples for the domain of research

interests. In the domain of research interests, D is the set of home pages and K is the set

of research topics for a context (D, K, I). However, the word documents and keywords

are also used interchangeably to denote home pages (or simply pages) and research

topics (or simply topics), respectively.

100

5.5.1. Conceptual Scaling for a Many-valued Context

A many-valued context is defined as a formal context C = (D, M, W, I) where D is a set

of documents, M a set of attributes, W a set of attribute values. I is a ternary relation

between D, M and W which indicates that an document d has the attribute value w for

the attribute m. We formulate a concept lattice with a set of documents and their

keywords as shown in Figure 5.1. This lattice structure is the main browsing space, but

is also an outer structure. Other attributes in a many-valued context are then scaled into

a nested structure of the outer structure at retrieval time.

Table 5.2 is an example of a many-valued context in the domain of research interests.

The attributes in the many-valued context can be represented in a partially ordered

hierarchy as shown in Figure 5.6. The attribute “position” in Table 5.2 is located as a

subset of the attribute “person” in the hierarchy. To explain this in a more formal way,

the following definition is provided.

Definition 9: Let Sp be a super-attribute and Sc be a sub-attribute. There is a binary

relation ℜ called the “has-value” relation on Sp and Sc such that (p, c) ∈ ℜ where p ∈ Sp

and c ∈ Sc if and only if c is a sub-attribute value of p.

For example, the has-value relation ℜ on the attributes “person” and “position” is: ℜ =

{ (academic staff, professor), (academic staff, associate professor), …, (research staff,

research assistant), …, (research student, Ph.D. student), (research student, ME

student)} from Figure 5.6. This hierarchy of the many-valued context with the relation

ℜ is scaled into a nested structure using pop-up and pull-down menus.

Table 5.2. An example of the many-valued context for the domain of research interests.

Research group Sub-group of AI Person Position

Researcher1 Artificial intelligence Knowledge Acquisition Academic staff Professor

Researcher2 Computer systems . Research staff Research associate

Researcher3 Networks . Academic staff Associate professor

Researcher4 Databases . Academic staff Senior lecturer

Researcher5 Software engineering . Research students Ph.D. student

Researchers can be the objects of the context as they are the instances of the home pages.

101

Figure 5.6. Partially ordered multi-valued attributes for the domain of research interests.

Figure 5.7 shows examples of inner browsing structures corresponding to concepts of

the outer lattice. A nested structure is constructed dynamically from the extent (home

pages) of a corresponding concept of the outer lattice incorporating the ontological

hierarchy. When a user assigns a set of topics for their page, the page is also

automatically annotated with the values of the attributes in the many-valued context. A

default home page for individual researchers is provided at the School Web site as well

as every researcher has a login account at the School. We make to use this login account

when a user annotates their home page. This provides the default home page address of

the user. The page is an HTML file in a standard format including the basic information

of the researcher such as their first name, last name, e-mail address, position and others.

The system parses the HTML file and extracts the values for the pre-defined attributes.

From the attributes and their extracted values, we formulate a nested structure for a

concept of the lattice at retrieval time.

School Biomedical Engineering Computer Science and Engineering … Research Groups

Arti ficial Intelligence Machine Learning

Knowledge Systems Knowledge Acquisition Robotics

Bioinformatics Computer Systems Databases Networks Software Engineering

Person Academic Staff Professor Associate Professor Senior Lecturer Lecturer Associate Lecturer … Research Staff Research Assistant Research Associate Research Fellow … Research Student Ph.D. Student ME Student …

102

Note that the attributes which do not exist in the default home page can be used for

conceptual scaling. The user will need to be presented to annotate the values of those

attributes when they assigns a set of keyword for their document. Here we recognise

that there will be significant issues relating to the annotation bottleneck as an

ontological approach presented in Chapter 2. However, in the proposed approach, the

user will not require an understanding of the notions of ontology such as relations,

constraints and axioms as in an ontological approach. The user will be given a simple

interface to click selection of values or a series of text boxes to be filled.

Figure 5.7. Examples of nested structures corresponding to concepts.

This shows nested structures corresponding to the concept “ artificial intell igence” and

“ artificial intelligence, machine learning” of the outer structure which is constructed with a set

of home pages and their topics. Numbers in the lattice and in brackets indicate the number of

pages corresponding to the concept of the lattice and the attribute value, respectively. Here, the

nested structure is presented in a hierarchy deploying all embedded inner structures. But the

structure is implemented using pop-up and pull-down menus as shown in Figure 5.8.

School Research Groups (7) Artificial Intelligence (6) Databases (1) Person (7) Academic Staff (4) Professor (1) Associate Professor (1) Lecturer (1) Associate Lecturer (1) Research Staff (2) Visiting Fellow (2) Research Student (1) Ph.D. Student (1)

School Research Groups (37) Artificial Intell igence (34) Databases (1) Software Engineering (2) Person (37) Academic Staff (15) Professor (4) Associate Professor (1) Senior Lecturer (4) Lecturer (1) Associate Lecturer (5) Research Staff (6) Research Associate (2) Research Fellow (2) Visiting Fellow (2) Research Student (18) Ph.D. Student (18)

7 Artif icial Intelligence,

Machine Learning

Database Appl ications

2

Machine Learning

5

Image Processing

2 2

Learning

…

Ar tif icial Intelligence

37

16

…

Machine Learning

103

A user can navigate recursively among the nested attributes observing the

interrelationship between the attributes and the outer structure. By selecting one of the

nested items, the user can moderate the cardinality of the display. Again, the structure

with the most obvious attributes can be partly equivalent to the ontological structure of

the domain and consequently is considered as an ontological browser which is

integrated into the lattice structure with the keywords set.

Figure 5.8 shows an example of pop-up and pull-down menus for the nested structure of

the concept “artificial intelligence” in Figure 5.7. The menu of � appears when a user

clicks on the concept “artificial intelligence” . Each item of menu � is equivalent to a

scale in the many-valued context. Suppose that the user selects the attribute Person in

menu � , the system then will display a sub-menu of the attribute as shown in menu � .

Note that the menu items of � are the values of the attribute Person in the many-valued

context in Table 5.2. If we assume that the menu item “academic staff” is selected, then

the menu of � will appear. The menu items of � are the values of the attribute Position

that are in a binary relation ℜ on the “academic staff” by Definition 9. The search

results will be changed according to the selection of an item of a menu.

Figure 5.8. An example of pop-up and pull-down menus for the nested structure of a concept.

5.5.2. Conceptual Scaling for a One-valued Context

Conceptual scaling is also applied to group relevant values in the keyword sets used for

the annotation of documents. The groupings are determined as required, and their scales

School �

Research Group � Person �

Academic Staff (15) �

Research Staff (6) �

Research Student (18) �

Professor (4) Associate Professor (1) Senior Lecturer (4) Lecturer (1) Associate Lecturer (5)

�

Cognitive Science (5) Machine Learning (12) Knowledge Acquisition (12) Knowledge Systems (8) Robotics (5)

� �

� �

Artificial Intell igence (34) �

Databases (1) �

Software Engineering (2) �

104

are derived on the fly when a user’s query is associated with the groupings. This means

that the relevant group name(s) is included into the nested structure dynamically at run

time. Table 5.3 shows examples of groupings for scales in the one-valued context for

the attribute keyword.

Table 5.3. Examples of groupings for scales in the one-valued context.

Grouping names

(Generic terms) The members of the grouping names

RDR FRDR, MCRDR, NRDR, SCRDR

Sisyphus Sisyphus-I, Sisyphus-II, Sisyphus-III, Sisyphus-IV, Sisyphus-V

Knowledge acquisition

Knowledge acquisition methodologies, Knowledge acquisition

tools, Incremental knowledge acquisition, Automatic knowledge

acquisition, Web based knowledge acquisition, …

Computer programming Concurrent programming, Functional programming,

Logic programming, Object oriented programming, …

Programming languages Concurrent languages, Knowledge representaion languages,

Logic languages, Object oriented languages, …

Databases

Deductive databases, Distributed databases, Mobile databases,

Multimedia databases, Object oriented databases, Relational

databases, Spatial databases, Semistructural databases

Natural language Natural language processing, Natural language understanding

Web Web applications, Web Searching, Web services,

Web operating systems, …

XML XML applications, XML tools, …

… …

Applied to a one-valued context, the following definition is provided:

Definition 10: Let a formal context C = (D, K, I) be given. A set G ⊆ K is a set of

grouping names (generic terms) of C if and only if for each keyword k ∈ K, either k ∈ G

or there exists some generic term κ ∈ G such that k is a sub-term of κ. We define S = K

\ G and a relation gen ⊆ G x S such that (g, s) ∈ gen if and only if s is a sub-term of g.

105

Then, when a user’s query is qry ∈ G, a sub-formal context C′ = (D′, K′, I′) of (D, K, I)

is formulated where K′ ={ k ∈ K | k = qry or (qry, k) ∈ gen} , D′ = {d ∈ D | ∃k ∈ K′ and

dIk} and I′ = { (d, k) ∈ D′ x K′ | (d, k) ∈ I} ∪ { (d, qry) | d ∈ D′ and qry ∈ K′ ∩ G} . For

instance, suppose that there are groupings as shown in Table 5.3 and a user’s query

“databases” . The query databases ∈ G so that a sub-context C′ is constructed to include

a scale of the grouping name databases and build a lattice of C′. The user can then

navigate this lattice of C′.

Figure 5.9 shows an example of a scale with the grouping name “databases” . The

grouping name is embedded into an item of the nested structure along with other scales

from the many-valued context in the previous section. There are 10 documents with the

concept “Databases” in the lattice, and the node (Databases, 10) embeds the scales as

shown in menu � . The scale “Databases” was derived from the groupings in the one-

valued context, while other scales (items) were derived from the many-valued context

(i.e., domain attributes). A user can read that there is one document related to

“deductive databases” , and two documents with “multimedia databases” etc. By

selecting an item of sub-menu � , the user can moderate the retrieved documents which

are only associated with the selected sub-term.

Figure 5.9. A conceptual scale for the grouping name “databases” .

Database Applications

6

Electronic Commerce

5 Knowledge Discovery

2 5

Data Mining

…

Databases

10

…

4

Data mining, Database applications

…

� School � Research Group �

Person � Databases �

Deductive databases (1) Mobile databases (1) Multimedia databases (2) Semistructured databases (2) Spatial databases (1)

�

106

The reason for formulating a sub-formal context C′ is that the lattice used for the outer

structure (with a set of documents and their keywords) does not include a node which

subsumes all documents related to the set of sub-terms of a grouping name, because the

documents associated with the sub-terms of a grouping may or may not be related to the

grouping name (i.e., generic term). Thus, we formulate the context C′ to have a relation

between the grouping name and the documents which are associated with at least one of

the sub-terms of the grouping. A lattice is then derived from the context C′.

As a consequence, a node, which contains all documents associated with the members

of the evolved grouping name, is contained in the sub-lattice. In this approach, we may

lose the advantage of lattice-based browsing that allows a user to navigate the whole

lattice freely exploring the domain knowledge. Because the space of navigation is

limited within the sub-lattice, the system even supports a link to start navigation in the

whole lattice at any stage. The following method therefore can be used as an alternative.

A knowledge engineer/user can set up or change the groupings using a supported tool

(i.e., ontology editor) whenever it is required. When a grouping name with a set of sub-

terms is added, the system gets the set of documents that are associated with at least one

of the sub-terms of the grouping name. Then, the context C is refined to have a binary

relation between the grouping term and the documents related to the sub-terms of the

grouping term. Next, the lattice of C is reformulated when any change in C is made. If a

grouping name is changed, it is replaced with the changed one in the context C and its

lattice.

In the case of removal of a grouping in the hierarchy, no change is made in the context

C. With this mechanism, the outer lattice can always embed a node which can assemble

all documents associated with the sub-terms of a grouping. That is, the groupings play

the role of intermediate nodes in the lattice to scale the relevant values. Groupings can

be formed with more than one level of hierarchy. This means that a sub-term of a

grouping can be a grouping of other sub-terms.

107


We presented an incremental domain-specific document management and retrieval

system based on lattice-based browsing of Formal Concept Analysis and outlined the

functionality of the system that we proposed. We focused on a Web document

management system for small communities in specialised domains based on free

annotation of documents by users. Another main focus was an emphasis on incremental

development and evolution of the system. A number of knowledge acquisition

techniques were developed to suggest possible annotations, including suggesting terms

from external ontologies. Lattice-based browsing was incrementally constructed as the

users annotate their documents and used as the basic structure for retrieval.

Document retrieval for end-users is based on browsing this lattice structure. Users can

interact with the system starting from the root of the lattice and exploring the

relationships of the concepts from vertex to vertex of the lattice without any particular

query being provided. The lattice display was simplified by showing only direct

neighbour lattice nodes using hyperlinks for a Web-based system. The user can also

formulate a query by entering any text words in a conventional Boolean query interface

or selecting terms from a list supported by the system, and can navigate the lattice

structure starting with a node covering the user’s query.

More importantly, the lattice was combined with a hierarchical ontological structure to

allow a nested structure at retrieval time dynamically, referred to as conceptual scaling.

In essence the conceptual scales give a view of a lattice formed from objects that have

specified attribute value pairs. Conceptual scaling was also used in a one-valued context

(i.e., the attribute keyword) to group relevant values in the keywords set. The groupings

are determined as required, and their scales are derived on the fly when a user’s query is

associated with the groupings.

The user can also view the lattice using one of the imported taxonomies available. This

recreated the lattice assuming that any object with an attribute from the imported

taxonomy also has all the parent terms for that term.

108

To demonstrate the value of the proposed approach, we conducted experiments in the

domain of research topics in the School of Computer Science and Engineering (CSE),

University of New South Wales (UNSW). We also set up a system that allows users to

annotate papers from the on-line Banff Knowledge Acquisition Proceedings. The

systems are presented in the next chapter.

109

Chapter 6

Implementation

Prototypes have been implemented on the World Wide Web to demonstrate and

evaluate the proposed approach. The first system is intended to assist in finding research

topics and researchers in the School of Computer Science and Engineering (CSE),

University of New South Wales (UNSW) 66. The goal was a system to assist prospective

students and potential collaborators in finding research relevant to their interests. There

are around 150 research staff and students in the School who generally have home pages

indicating their research topics. The system allows staff and students to freely annotate

their home pages so that they can be found within an evolving lattice of research topics.

The second implementation is a system67 that gives access to the on-line Banff

Knowledge Acquisition Proceedings with around 200 publications in recent years68.

The system will be described mainly with reference to the domain of research interests.

In the domain of research interests, a document corresponds to a home page and a set of

keywords is a set of research topics.

Section 6.1 gives an overview of the system we propose. Section 6.2 outlines the basic

environment of the system. The implementation with the domain of research interests is

described in Section 6.3. We present how documents associated with the annotation

mechanisms can be managed by users themselves, and how the annotated documents

can be searched using both browsing and Boolean queries. The system for the domain

of the Banff Knowledge Acquisition Proceedings is presented briefly in Section 6.3.2.

66 URLs of the system: http://www.cse.unsw.edu.au/search.html and

http://www.cse.unsw.edu.au/school/research/index.html pointing to the

http://pokey.cse.unsw.edu.au/servlets/RI. 67 http://pokey.cse.unsw.edu.au./servlets/Search. 68 KAW96, KAW98 and KAW99 (http://ksi.cpsc.ucalgary.ca/KAW/, 2000).

110

6.1. Overview of the System

Figure 6.1 shows the architecture of the system we developed for a domain-specific

document management and retrieval system. The system has two main functions - a

“document management engine” and a “document retrieval engine”.

The “document management engine” builds and maintains knowledge bases for

documents and a concept lattice for browsing. Users themselves annotate their own

documents with a set of keywords using knowledge acquisition mechanisms (i.e.,

annotation support tools) that aim to capture the concepts which are missed or unknown

when the keywords are first assigned for a document.

When a user annotates their document, they can select keywords already used in the

system which have been added by others or enter further textwords which in turn will be

available to future users. In other words, the user is provided with a list of keywords

already available. After an initial selection, the user can view other terms that are

imported from other taxonomies. The system extracts all parents of the term up the

hierarchies of taxonomies which are related to the initial selection and presents them to

the user. The system also indicates keywords that have been used together with the

keywords already selected for other documents in the lattice structure. Through these

and further knowledge acquisition steps, the initial keywords can be refined.

Figure 6.1. Architecture of the system.

Document Retrieval Engine

General Interface

Browsing Interface

Results Query

Users User Interface

Users

Annotate Document

Knowledge Acquisition Tools (i.e., annotation support tools)

Document Management Engine

KBs

Stemming Classes,

Stopwords

Concept L attice Logs

Domain ontology, Imported ontologies (i.e., taxonomies)

Documents

{doc1;k1,k2,.

111

Then, the case (a document with a set of keywords) is added into the system, triggering

the update of the concept lattice which is used as a basic data structure for indexing

documents and browsing in our approach. This concept lattice is incrementally and

automatically reformulated whenever a new case is added or existing cases are changed.

Figure 6.2(a) shows an example of a lattice.

The second main function is a “document retrieval engine” for finding documents

constructed in the concept lattice. The user can browse the lattice structure to find

information. The user can also formulate a query by entering any textwords in a

conventional information retrieval fashion or by selecting a keyword from those that

had been used for annotating the documents. If a keyword has been selected or

textwords identify some keywords, the system identifies the appropriate node and

displays it together with its direct neighbours. The user can start navigation from this

node.

If the system does not include a node with the given keywords, it displays a sub-lattice

which covers documents that contain the textwords anywhere in the document. The user

can navigate this sub-lattice, and also transfer to the same node in the overall lattice. If

the textwords entered do not correspond to a node, the system also sends a log file to an

expert so they can decide if more appropriate keywords are required for the documents.

A knowledge engineer can define the attributes of the evolved domain with a partially

ordered hierarchy among the attributes. This requires a prior domain ontology in the

same way as (KA) 2 and is included in our system only for completeness. We suggest

that it will be used only for the most obvious attributes rather than for implementing a

fully developed ontology. When a user annotates his/her document, the system then

automatically extracts the values of the attributes defined from the content of the

annotated document. The attributes and their values are accessed via nested browsing as

shown in Figure 6.2(c). The concept lattice with a set of documents and their keyword

sets becomes the outer structure as shown Figure 6.2(b) and serves as the main

navigation space. The structure of Figure 6.2(c) is nested in a corresponding concept of

the outer lattice on the fly. That is, nested browsing is constructed dynamically at run

112

Figure 6.2. An example of a browsing structure.

(a) Lattice structure. (b) Indexing of the lattice. (c) Nested structure69. (d) A home page

(URL)70.

time from documents belonging to a corresponding concept of the outer lattice and

based on the structure of the attributes defined. This provides conceptual scaling

between the domain attributes and the search results with the keywords set. It allows the

user to obtain more specific search results, reducing the complexity of the navigation

space. For instance, the user can read that there is a researcher whose research topic is

“Artificial intelligence” and her position is “Professor”.

Once again, the system can be explored at: http://pokey.cse.unsw.edu.au/servlets/RI and

http://pokey.cse.unsw.edu.au/servlets/Search. The following section outlines the basic

environment of the system.

69 Numbers in parentheses indicate a number of documents corresponding to the attribute value. 70 A document is connected to an HTML page.

(b)

({ Artificial intelligence} , { doc1, doc2, doc3} )

({ Knowledge acquisition} , { doc1, doc2, doc4} )

({ Arti ficial intelligence, Knowledge acquisition,

Ripple Down Rules, Knowledge-based systems} ,

{ doc1} )

({ Artificial intelligence, Knowledge acquisition,

Ripple-Down Rules, Machine learning} ,

{ doc3} )

({ Arti ficial intelligence, Knowledge acquisition,

Formal Concept Analysis, Ontology} ,

{ doc4} )

({ Arti ficial intelligence, Knowledge acquisi tion} ,

{ doc1, doc2} )

(a)

(c)

URL

113

6.2. Basic Environment of the System

The system was developed with Java, JavaScript and Java Servlets (Java CGI: Common

Gateway Interface). The internal structure of the system is composed of a Web server,

and an interface environment based on a client/server architecture. The server is written

as a CGI library (Java Servlets) and Java. The interface is based on HTML supported by

a Web browser such as Netscape 4.0 and Explorer 5.0 or higher.

Secur ity

Anyone can access and browse the lattice to find information within the system.

However, for annotations, only staff and research students of the School of Computer

Science and Engineering, UNSW can annotate pages for research topics as the only

documents the system provides access to, are the home pages of these staff and students.

We use the local Unix account at the School to authenticate users for the annotation.

This also provides a default home page address for the users. This security system is

specific to this application and different approaches will be required for other

applications.

Annotation Mechanism

Only annotations and the URLs of the pages are stored on our local server. Further

development of the project would probably look at using encoded annotations from

within the document as well as having annotations stored on the server. It requires a

new version of the document which is marked up with annotations from the server.

Browsing Structure Generation

The system has an automatic document clustering feature which creates its clusters

using terms taken from the annotated documents. We construct a conceptual lattice

browsing structure which relates documents and clusters (keywords) as well as showing

relationships among documents and among keywords. The system updates the browsing

structure (concept lattice) whenever a new document is added with a set of keywords or

when the keywords of existing documents are refined. This is essential if users are to get

immediate feedback on the clusters that emerge from changes in annotation.

lilachc

הערה

בדיוק כמו בפנקס- שמירה של הלינקים בלבד

lilachc

עט סימון

114

User Interface for Browsing

The system has a Web interface, and the lattice for browsing is simplified by showing

only direct neighbours in the lattice using hyperlink techniques. The children and

parents are hypertext links and the user navigates by clicking on parent and children

nodes. Hypertext links to documents associated with the current node are also shown

along with a brief summary of the page.

Knowledge Engineer ing Support

Although the system supports annotation by users without intervention of a knowledge

engineer, the system also supports the notion of a domain manager who can make some

behind the scene changes to improve the functionality of the system. But the role of the

knowledge engineer can be reduced and/or replaced by the user.

6.3. Presentation of the System

Section 6.3.1 will present the system with reference to the domain of research interests

in detail. Following that, the system for the domain of the Banff Knowledge Acquisition

Proceedings will be described briefly in Section 6.3.2.

6.3.1. Domain of Research Interests in a Computer Science School

6.3.1.1. Document Annotation

A researcher can annotate their own home page with a set of research topics by

selecting among the topics already used or by freely specifying new topics through

given interfaces. When the researcher logs onto the system using the local Unix

account, the system authenticates the user. If the user is identified, the system extracts

the basic information of the user (e.g., name, phone number, fax number, e-mail address

and homepage address) from his/her default home page address and displays the

annotation screen as shown in Figure 6.3. Note that a default home page for individual

researchers is provided at the School Web site in an HTML file. The system parses the

HTML file of the user and extracts the values for the pre-defined attributes.

115

Figure 6.3. An example for the annotation of a home page.

When a researcher logs on to the system, the above screen will be displayed for the annotation

of the page. The rest of the screen displays the further topics used by other researchers plus

those contained in the imported taxonomies.

Topics are initially selected by clicking the checkbox in front of each term, and/or

entering any new topics. To assist in finding relevant topics from those already used,

the researcher can select from the topics used by other researchers with whom they may

share interests. This can be done via the link from other UNSW researchers in the

above screen. Some researchers would like to examine the annotated research topics of

their collaborators. The annotator (researcher) can see a list of topics based on each of

the selected researchers as shown in Figure 6.4.

116

Figure 6.4. An example of selecting topics from other researchers.

First the annotator needs to select researchers to view their research topics through a

given interface. The system then will display topics based on the selected researchers as

above. The annotator can choose topics by clicking the checkbox of each term s/he

would like to assign as topics. After the annotator has selected some terms, s/he is then

presented with a display of terms that are imported from other taxonomies and that co-

occur with the selected terms in the lattice as shown in Figure 6.5. The purpose is to

prompt them to consider groupings of terms used by others that may be related. Some

topics may be “made up” in collaboration with other researchers and/or research groups.

The researcher can annotate the page with these further terms if desired.

117

Figure 6.5. An example of displaying possible relevant topics for the page being annotated.

In the above screen, the hyperlink Hendra Suryanto (the researcher being annotated) is

connected to the annotator’s home page. The link relevant pages shows documents

ordered by a similarity with the research topics of the annotator and the link sub-lattice

shows a sub lattice of these pages. The taxonomy links (UNSW, ACM, ASIS) take one

to the hierarchy of each taxonomy and the research topic links listed take one to

documents (i.e., researcher pages with these topics). The numbers in the parenthesis

indicate the relevancy weight of the topic to the annotator’s initial choice of topics.

118

The terms suggested from the external taxonomies are extracted from the ACM

computing classification taxonomy and ASIS&T thesaurus for information science.

They are also extracted from the UNSW taxonomy which has been developed using the

hierarchical clusters of the Open Directory Project, the KA2 community Web site, and

the research areas at the School of Computer Science and Engineering, UNSW. When a

term from the initial assignment of topics occurs in one of these hierarchies, the system

shows all the parents of this term up the hierarchy (i.e., predecessors). However, the

results from the various hierarchies are merged into a single list.

The terms “Learning” and “Knowledge Engineering” in Figure 6.5 are the parents of

terms in the taxonomical hierarchies of the topics that the annotator had initially

assigned. Any of these terms can be selected by the annotator and added to the

document. Note that in an inheritance sequence, the user is free to pick any or none of

the parent term up the hierarchy. For example, a general parent may be selected, but the

immediate parent may be omitted. Relationships between terms evolve dynamically and

are determined by Formal Concept Analysis, rather than being constructed from the pre-

existing hierarchical clusters for general situations (purposes).

By taking into account specific pages (documents) and topics in the lattice, other terms

are suggested that co-occur with the topics the annotator has assigned. These terms are

presented to the annotator ordered by their weight which is normalised for the number

of terms at the node, and their “closeness” to the node to which the page is assigned by

the annotator’s initial choice of terms. Again the annotator simply clicks the check box

located in front of each term to select it.

At this stage the annotator can view the set of documents ordered by a similarity

measure71 in the lattice with the current page as shown in Figure 6.6. As well, the

annotator can observe a sub-lattice constructed of relevant documents by clicking a

hyperlink on this screen. The annotator can also view the pages listed alphabetically for

each of the related topics as well as the existing lattice structure. Through these

processes, the annotator may find other relevant topics s/he has missed.

71 See Section 5.3.3 for details.

119

Figure 6.6. An example of relevant pages with the page being annotated.

The hyperlinks are connected to the researchers’ home pages.

After these procedures, the page (document) can be located at more than one node in a

lattice. One node in particular is unique and has the largest intent among the nodes

where the page is located. If there is another page already at the node, the annotator is

presented with the previous page and is given the opportunity to include topics that

distinguish themselves with the previous page. Figure 6.7 shows an example of this. A

further page may be prompted by the newly added topics. So that this process will be

continued, until there is no further page that has the same keyword set (topics) as those

of their page. The annotator can choose to leave the two pages together with the same

topic. Ultimately however, every home page is unique and offers different resources to

other pages and probably should be annotated to indicate the differences.

120

Figure 6.7. An example of identifying related pages.

This shows a stored case (in the above screen called a cornerstone case) that matches the

current case being added. Topic(s) can be added to differentiate two cases by adding any new

term and/or selecting the terms in the check boxes as before.

When the above stage is complete, the concept lattice is automatically rebuilt and the

page (document) is located at a node of the lattice. The annotator can immediately view

the concept lattice that incorporates his/her page and further decide whether the set of

topics s/he assigned for the page are appropriate. The navigation process itself can be a

learning process for the annotator to capture and discover domain knowledge, and can

influence keyword choices.

121

6.3.1.2. System Maintenance by a Knowledge Engineer

Even though a user can annotate his/her page without the intervention of a knowledge

engineer (or manager), the system supports the use of a knowledge manager who can

make some changes to improve the functionality of the system.

The knowledge manager receives reports of all new terms entered as it is possible that

any pages located at parent nodes to the node with the new term should perhaps also be

annotated with this term. In other words, when a new term is entered for a new

document; this term may also appropriately apply to other documents already in the

system. In this case the system extracts those relevant documents (pages) at the direct

parent nodes of the new node in the lattice and passes them with the new term(s) to the

manager. The manager decides whether it is appropriate to contact the owners of these

pages to see if they want to use the new annotation. Figure 6.8 shows an example of

this occasion.

Figure 6.8. An example of adding new terms.

122

For example, by adding a new case (the researcher “Akara Prayote” and his research

interests), the topic “network fault diagnosis” may be relevant to the researcher “Paul

Compton” and “Abdus Khan” located at the parent node of the new node. If the

manager selects the suggested topic and clicks on the “Save” button, the system will

create e-mail and send it to the respondent researcher to facilitate the assignment of the

suggested topic if desired.

Another mechanism is activated when the system cannot find a node in the lattice for a

user’s query. If a user searches with a term that is not a keyword used for the

annotations, a textword search is carried out. In this case a report is sent to the

knowledge manager as this may suggest that a new keyword needs to be added to the

system. If the manager makes a decision for the case(s), the system creates an e-mail

automatically and sends it to the author (annotator of the document). It includes a

hyperlink which can facilitate the refinement of the keywords of a document if desired.

As the system evolves, new terms are added. As a consequence, there is a necessity to

handle synonyms, abbreviations or to group relevant terms together for extending a

user’s query. The knowledge manager has access to a tool (i.e, ontology editor), which

allow him/her to identify abbreviations, synonyms or groupings. A fairly simple and

standard graphic editor is available for this task.

The screen in Figure 6.9 shows how relevant terms are grouped and edited. The

manager can set up partial hierarchies so that related terms can be grouped under a

common name. For example, the terms “Deductive Databases”, “Distributed

Databases” , “Mobile Databases” , “Object Oriented Databases” and “Relational

Databases” can be grouped with the name “Databases”. Then, when a user’s query is

relevant to the term “Databases” , all the documents that include the terms belonging to

the group “Databases” are retrieved and a nested structure which represents the group

hierarchy is also supported. Note that this is different from the search feature which

shows all keywords that contain a given sub-string (see Figure 5.9 in Chapter 5 and

Figure 6.11). The knowledge manager can also edit synonyms and abbreviations. If a

user’s query uses one of these synonyms (or abbreviations), the system extends the

query based on the relevant synonym.

123

Figure 6.9. An example of editing grouping names.

This shows a snapshot of editing a group name “ Databases” . The left-hand side of the screen is

the browser for groupings. The knowledge manager can also edit synonyms and abbreviations

using the link Edit Synonym and Edit Abbreviation, respectively.

6.3.1.3. Document Retrieval and Browsing 72

The main search mechanism is based on browsing a concept lattice of FCA. Browsing is

based on showing a Web page with the hyperlinks. A user can interact with the system

starting from the vertex of the lattice and exploring the relationships of the concepts

(topics) without any particular query being provided. Figure 6.10 shows the top-level

concepts of the lattice.

72 The browsing structure here can be different with the structure of the on-line system as the system

evolves. As well, with l imited screen size, some branches of the lattice are omitted in the example

Figures.

124

Figure 6.10. A snapshot of browsing the top-level concepts.

A text box for entering topics is shown. A complete list of concepts is also shown. The concepts

(topics) are hyperlinks to that concept node in the lattice. The numbers in brackets indicate the

number of researchers at each node.

Navigating the lattice, users can select terms from supported topics or enter terms into a

text box. This means that the user can specify a query by entering any textwords in a

conventional information retrieval fashion or by selecting a term among those already

used for annotating documents. A set of words can be entered separated by commas

(“ ,” ) assuming the AND Boolean operator. Stopwords are first eliminated and the

remaining query is stemmed using the stemming classes. If the entered term is a

keyword, the system identifies the most relevant portion in the lattice for the query and

moves to this node - displaying only the direct neighbours of the node. Figure 6.11

shows the search result when the user selected the hyperlink “Artificial Intelligence

(39)” from Figure 6.10 or entered a query with “artificial intelligence” .

125

Figure 6.11. An example of a browsing structure.

Figure 6.11 shows the search result with the term “artificial intelligence” . The URLs for

these researchers can be accessed via the folders on the left. The researchers for the

current node are also listed at the bottom of the screen (shown partly). The “Nested”

button gives a Conceptual Scale view as appropriate. The taxonomies available are at

the top of the main screen. Users can extend the search result based on one of these

taxonomies. The system also displays the topics which subsume the user’s query if they

exist. In this instance, the topic is “Distributed Artificial Intelligence”.

The user can start navigation from the node by clicking a hyperlink among the sub-

concepts or entering a new topic again. Note that the term of the current concept is

omitted from each sub-concept to moderate the display space. That is, the sub-concept

Agent (4) is the abbreviated form of Artificial intelligence, Agent (4). If we suppose that

Data Mining (7) is selected, then the content of the screen will be changed as shown in

Figure 6.12. All direct parent and child concepts of the selected concept are displayed.

126

Figure 6.12. An example of the main features of the lattice browsing interface.

Figure 6.12 presents the main features of the lattice-browsing interface that shows all

direct parent and child nodes of the current concept. To facilitate the user’s

understanding of parent and child concepts, we use different colours (red for parents,

green for the current concepts and blue for the child concepts).

Users who search for Data Mining under Artificial Intelligence find that there are only 7

researchers in this area. However, this node has 2 parents and so the lattice view makes

it obvious that there are in fact 17 researchers in the School who do research in Data

Mining. The user can navigate the parent concepts to search for more general documents

or navigate the child concepts to get more specific documents. If the user selects a

parent concept “Data Mining (17)” , s/he can observe the lattice from the “data mining”

point of view.

127

Figure 6.13. An example of a textword search.

If the entered term does not exist in the concepts of the lattice, a typical textword search

is carried out. Documents will be retrieved which contain these textwords in their

contents and then a sub-lattice will be constructed with the retrieved documents and

their keywords. Figure 6.13 shows an example of this textword search. Navigation is

still via the same lattice display, but only accesses the sub-lattice. In this case, the

system sends a log file to an engineer so s/he can decide whether more appropriate

research topics should be included for the pages. The user can return to the full lattice at

any stage via the link Artificial Intelligence in the above screen.

More importantly, a partially ordered hierarchical display is also available. This display

is generated using the Conceptual Scale extension to FCA. In essence this gives a view

of a lattice formed from objects that have the specified attribute-value pairs. Figure 6.14

shows an example of the nested structure of the concept “Artificial Intelligence”. We

build a concept lattice using the result pages with their topics as an outer structure and

scale up other attributes into an inner nested structure. The nested structure is

constructed dynamically and associated with the current concept of the outer structure.

In other words, the nested attribute values are extracted from the result pages.

128

Figure 6.14. An example of the nested structure of a concept.

A nested pop-up menu appears when the user clicks on the “nested” icon in the front of

the current node. If the user clicks on one of the menu items, the results will be changed

according to the selection. For instance, we suppose that the user selects the menu items

“Position” � “Academic Staff” � “Professor” . The result then will be changed as

shown in Figure 6.15. Numbers in brackets indicate the number of documents

corresponding to the attribute value. For a more detailed discussion of Conceptual

scaling for the proposed system refer to Section 5.5 in the previous chapter.

As well, a knowledge engineer can arrange related terms by accessing a tool which

allows him or her to set up hierarchical grouping related terms under a common name.

Then, when a user’s query is related to the grouping(s), the grouping name is included

into the nested structure on the fly. An example can be seen in Figure 6.19 in the

following section.

129

Figure 6.15. The search result with the selection of nested items.

This shows the result of the selection: “ Position” � “ Academic Staff” � “ Professor” in

Figure 6.14. The user can read that there are four researchers whose research topic is

“ artificial intelligence” and whose position is professor.

The user can also extend search results by using one of the imported taxonomies

available. In this case the ACM, ASIS&T, Open Directory Project (DMOZ) and the

local UNSW taxonomy are available at the top of the search screen. Using one of these

will recreate the lattice assuming that any documents annotated with a term from the

imported taxonomies also has all the parent terms for that term up the hierarchy. At

present the taxonomies are manually imported from the relevant Web pages. As XML

representation standards for ontologies become better established, importing a

taxonomy and using it to give a different lattice view will only require entering a URL.

130

Figure 6.16. An example of the search result extended by a taxonomy.

Figure 6.16 shows the result from Figure 6.11 as extended by the ASIS&T taxonomy.

Note that the change is the number of AI researchers from 39 (Figure 6.11) to 45

(Figure 6.16). The taxonomy includes the term “artificial intelligence” so that the

documents are retrieved annotated with not only the term “artificial intelligence” , but

also with child terms of the term “artificial intelligence” in the taxonomy. One can

browse this lattice or alternatively one can navigate the lattice without a taxonomy at

any stage via the link NONE at the screen.

131

6.3.2. Domain of Proceedings Papers

Our first implementation of the proposed approach with FCA was a system that gives

access to the papers of the on-line Banff Knowledge Acquisition Proceedings. Since

1996, all papers for these proceedings have been published on the Web (KAW96,

KAW98 and KAW99; http://ksi.cpsc.ucalgary.ca:80/KAW/). The system is accessed at

http://pokey.cse.unsw.edu.au/servlets/Search. We have previously described the system

implemented on this domain (Kim and Compton 2000; 2001a).

All papers were annotated by a knowledge engineer. The focus of this implementation

was to explore some possibilities for a browsing mechanism based on Formal Concept

Analysis. However, anyone with access to the WWW can set-up and change annotations

to any page on the World Wide Web. That is, there is no security for this domain.

One difference with the system for the domain of research interests is that a hierarchical

conceptual clustering of the documents associated with a user’s query is supported as

shown on the left side of Figure 6.17. This shows an automatic document-clustering

feature similar to general clustering search engines such as Vivisimo and WiseNut.

However, clusters here result from relationships between objects (documents) and

clusters (organised terms - keywords) based on FCA.

To construct the clustering structure, the system formulates a sub-concept lattice using

the retrieved documents and their keywords based on FCA. Then, for each formal

concept at the first level of the lattice, a hierarchical clustering is built using their child

concepts for three levels of the lattice. This means that the concept lattice is converted

into a hierarchical structure. The clusters are dynamically constructed from the search

results of the user query. We have tried to provide diverse interaction modes to assist

users with different interaction preferences and different needs.

Lattice browsing and other search features such as a textword search, nested

hierarchical browsing are also supported as in the system of research interests.

132

Figure 6.17. An example of a search result and a hierarchical clustering.

The left-hand side of Figure 6.17 shows a hierarchical clustering structure for the query

“ ripple down rules” . The search result of the query is displayed on the right-hand of the

screen. Sub-clusters of each item will appear in pull-down and pop-up menus when an

item is clicked. Users can obtain more specific documents by selecting one of these sub-

clusters. Using the link Lattice Browsing at the top of the screen, users can browse the

concept lattice to find documents in the same way as the domain of research interests in

the previous section. Figure 6.18 shows the browsing scheme based on a concept lattice

for this domain.

133

Figure 6.18. An example of navigating the concept lattice.

This shows the lattice-based browsing interface for the Banff proceeding domain. The features

are same as Figure 6.11.

Figure 6.19 shows the nested menu structures for the grouping “ripple down rules” . The

nested menu items for “Authors” , “Publication Years” and “Proceeding Titles” were

obtained from predefined domain attributes. But the item “RDR” was obtained from the

grouping of related values together under a grouping name that a knowledge engineer

had previously set up. This is an example of conceptual scaling for a one-valued context

presented in Section 5.5.2.

134

Figure 6.19. An example of a nested structure for a grouping.


The purpose of the implementations described was to demonstrate and evaluate the

proposed approach through a prototype with case studies. The system we have

developed was aimed at multiple users being able to add and amend document

annotations whenever they chose. The users were also assisted in finding appropriate

annotations and the lattice was immediately updated. The end users could find

documents both by browsing a lattice-based conceptual structure and by conventional

Boolean query. Conceptual scaling for the lattice structure was also supported to allow

135

users to find more specific results from the interrelationship between specified attribute

value pairs and the keywords of documents.

We made the system available on the School Web site and recorded all users’ activities

- both for searching and annotating their home pages. We also provided Web evaluation

questionnaires on user’s preferences for the lattice-based browsing mechanism and on

the efficiency of the annotation mechanisms. The next chapter will present the

experimental results on the proposed system based on these activities.

Certainly, it is expected there would be a better response to such a system, where users

could change their research topic annotation and immediately see the impact of the

change in the clusters of researchers that resulted. However, the objective of the

implementation was to observe whether the proposed mechanism could be applied to a

document management system for specialised domains.

136

Chapter 7

Experimental Evaluation

This chapter presents our experiment in using the system that we proposed in Chapter 5.

It was used to find staff and student home pages based on their research interests in the

School of Computer Science and Engineering, University of New South Wales

(UNSW). For the experiment, the system was made available on the School Web site

and recorded all users’ activities both for searching, and adding and changing the

annotation of their home pages. Evaluation forms were also set up for both lattice-based

browsing and annotating which we invited users to fill in.

To date 80 annotated home pages are registered in the system. About 300 search

activities were performed both by internal and external users. More use of the system

may be required to further evaluate useability. However, we believe that the data we

have gathered for this experiment is enough to determine whether the proposed system

is a useful alternative for document management and retrieval for a specialised domain.

We have previously presented a preliminary evaluation of the system (Kim and

Compton 2002a; 2002b).

Section 7.1 gives an overview of the experiment. Section 7.2 presents the experimental

results. Firstly, we present the results on whether the annotation mechanisms gave users

useful assistance in annotating their home pages so that the search performance of the

system was improved. These come from an analysis of the users’ annotation activities

that we logged, as well as an analysis of the questionnaire data. Secondly, the necessity

of document management systems that evolve, rather than systems that only use a priori

ontologies (or taxonomies), is discussed in relation to the experiments. Thirdly, we

present the results on whether the browsing structure evolved into a reasonable

consensus when users freely annotated their documents.

137

7.1. Experimental Design

Our first implementation of the proposed approach with FCA was a system for the

papers of the on-line Banff Knowledge Acquisition Proceedings as presented in Section

6.3.2. With the domains of papers, gathering users’ statistics would have been extremely

difficult because we do not have the control over the Banff server. We chose the

annotation of researchers’ home page as an evaluation domain because it was

anticipated that researchers would be motivated to assist prospective research students

and other collaborators to find them. Home pages generally describe research interests

and the system would help students in finding interesting research areas. In addition, as

there are always students looking for supervisors it is anticipated that in time there

would also be sufficient browsing, and sufficient prospective students would be willing

to fill out the Web evaluation questionnaire.

It was also felt that the researchers might be more interested in using the system if they

could immediately see where they fitted in the existing lattice, and so they may be more

motivated to make changes. To this end, the starting lattice was populated by

automatically annotating researchers’ home pages (37 academics) with terms specified

as their research areas in the School’s research topic index. The problem with this

approach of course is that we cannot then see how a lattice would evolve starting from

scratch. Once the system was set up, it was opened up for use by staff, research fellows

and Ph.D. students. In this case their home pages were not initially annotated.

Previously the School had used simple research topic indices and permuted indices73.

These require the School office to make periodic requests for lists of topics and

individuals respond independently. The School now has links through to the lattice-

based browser from a number of different pages74. However, as this is an experimental

project the various School research indices have been continued and the links to the

lattice search have not been particularly highlighted.

73 http://www.cse.unsw.edu.au/school/research/curresearch.html and

http://www.cse.unsw.edu.au/school/research/research2.html. 74 http://www.cse.unsw.edu.au/search.html and http://www.cse.unsw.edu.au/school/research/index.html.

138

7.2. Experimental Results

Presently 80 annotated home pages are registered in the system with an average of 8

research topics (ranging from 2 to 27). The lattice contains 471 nodes with an average

of 2 parents per node (ranging from 1 to 10) and path lengths from 2 to 7 edges.

Table 7.1 shows the number of home pages annotated. Of the 37 academics who had

their home pages automatically annotated, 16 refined their research topics after

deployment. This means that 21 researchers among the 37 might be either happy with

their automatic annotations or ignored the experiment. After the system was made

available on the Web, another 43 research staff and students carried out annotation of

their pages. Consequently, 59 staff and students have actively participated in the

annotation of home pages. Of interest, almost half of the staff and students who started

out with the system later changed their annotations. This result alone suggests the need

for evolutionary user based annotation.

Table 7.1. Number of pages annotated.

Pages automaticall y annotated before deployment

Pages annotated by research staff or students Total

Number of annotated pages 37 43 80

Number of pages for which the initial annotations were later changed

16 19 35

7.2.1. Annotation Mechanisms

This section presents the results on the annotation mechanisms for incremental

development. The main difference between the system we have implemented and

previous work in this area is an emphasis on incremental development and evolution for

specialised domains. These come from an analysis of the users’ annotation activities

that we logged and an analysis of the questionnaire data.

7.2.1.1. Users’ Annotation Activities

The results of the users’ annotation activities will be presented based on each phase of

the annotation process described in Chapter 5. Table 7.2 summarises those phases.

139

Table 7.2. Task for each phase of the annotation process.

Phase 1 Select topics from list or add new topics. The list includes topics used by other researchers (and can be viewed by researchers) and topics from external taxonomies75

Phase 2 Select topics from terms suggested from taxonomies

which are relevant to the assigned topics in Phase 1

Phase 3 Select topics from the terms that co-occur in the lattice

with the assigned topics in Phase 1

Phase 4 Adding topics to differentiate home pages

Phase 5 Adding new terms

Phase 6 Logging users’ queries

Table 7.3. Number of terms added at each phase for 59 home pages.

Phase A number of assigned terms Percentage

Phase 1: Reused terms in the system

or new added terms 468 79 %

Phase 2: Imported terms from taxonomies 19 3.2 %

Phase 3: Co-occurred terms in the lattice 99 16.8 %

Phase 4: Added terms to distinguish pages 2 0.4 %

Phase 5: Added terms from new terms 2 0.4 %

Phase 6: Added terms from users’ queries 1 0.2 %

Total 591 100 %

Table 7.3 shows the number of terms added at each phase for the 59 pages which were

annotated by the active participation of researchers. There were 99 annotation activities

performed by researchers: 43 for new annotation cases and 56 changes to existing

annotations (including 16 cases where the pages had been populated prior to use). Note

that there were 56 changes to 35 home pages. Therefore, some of these pages were

changed more than once (< 56 refinement activities). The number of assigned terms for

each phase was recalculated for the home pages that had been changed by referring to

the annotation histories. Then, the total number of assigned terms for the 59 pages at

Phase 1, 2 and 3 were computed.

75 Note that two taxonomies have been imported from commonly available Web sites (i.e., ACM and

ASIS&T) and adjusted for our purpose by pruning. We have also developed a hybrid taxonomy called

UNSW by combining a number of taxonomies which are considered relevant to “research topics” areas

for the School of Computer Science and Engineering (CSE).

140

The results indicate that 19 terms (3.2%) were discovered from the imported

taxonomies and that 99 terms (16.8%) were co-occurring terms that were suggested.

Two terms were discovered for distinguishing related documents, and another two terms

from adding new terms and one term from the data that we logged from users’ query

activities. This means that 123 terms (21%) of the total 591 were provided by the

supplementary supported mechanisms.

Phase 1, 2 and 3

As presented in Chapter 5, when a user starts to annotate their home page, the system

displays all the topics used by others as well as all those contained in the taxonomies

(Phase 1). After the initial assignment, the user can view other terms that are imported

from the taxonomies (Phase 2). These terms are all parents of the terms the user entered

in Phase 1 from the taxonomies. In Phase 3, the user can also view terms that co-occur

in the lattice with the terms s/he has provided at Phase 1. The user can then annotate

their home page with some of these terms if desired.

Phase 4: Adding Topics to differentiate Home Pages

This annotation method is initiated when one user annotates his/her home page with the

same topics as another user. Note that this feature is imported from the RDR technique

which differentiates between a new case being added and the other stored cases. The

system retrieves a document that is annotated with the same terms and suggests that the

user may wish to differentiate their home page from the document. Only two such cases

occurred. This mechanism may have been more important if there had been no start up

annotations. Note that 37 academics’ home pages were annotated automatically when

the system was made available to users. This method might be useful for two

researchers who immediately decided they wanted to distinguish themselves.

Furthermore, this will be more significant in a domain which has a densely similar

context (i.e., a context which has many similar documents). We have observed that in

document management for an individual using the proposed approach76, this situation

for differentiating related documents occurred quite often.

76 http://pokey.cse.unsw.edu.au/servlets/DMR?kb=mihyek/bookmark&userId=mihyek.

141

Phase 5: Adding New Terms

The additional annotation method provided is relevant when a new term is entered for a

new document and the term may also appropriately apply to other documents already in

the system. This situation has occurred in 240 cases, but only 25 cases among those

were considered. A knowledge engineer examined the cases suggested by the system

(240 cases) and interviewed the annotators of 25 of the pages to decide whether the new

term is relevant to these home pages. As a result, two researchers added the suggested

term to their research topics. We have not paid close attention to this mechanism. We

realised that it was too costly to go through every case and it was also difficult for the

knowledge engineer to determine whether the new term is relevant to the suggested

pages. However, our aim was to consider as many factors as possible that may be useful

for discovery of terms relevant to the annotated documents.

Phase 6: Logging Users’ Quer ies

Further mechanisms involved referring cases to a knowledge engineer when a query did

not include the topics used by annotators and a textword search was invoked. A log of

such transactions was sent to the knowledge engineer. This phase is caused by the users’

search activities, rather than the users’ annotation activities. However, this is also one of

the mechanisms for the incremental development of the system.

There were 161 such cases. For five cases (“agent oriented systems”, “compiler design” ,

“mobile commerce” , “ logistics” and “compilers” ) the knowledge engineer interviewed

researchers whose home page contained the terms. As a consequence, the term

“compiler” was added as a research topic to a home page. Other terms (156 terms) were

regarded as not relevant. Among those, 59 terms were related to the researhers’ name

(first name or last name and both). This suggests that it may be useful to index

documents with the names of researchers. However, in this application there were

adequate mechanisms to find researchers by name. There were also five abbreviation

terms: “ai” , “db”, “b2b” , “ ICMS” and “cs”. It was clear that abbreviations would

enhance the retrieval processes. As a consequence, the (rather obvious) abbreviation

synonyms shown in Table 7.4 were added to the system using the ontology editor

presented in Chapter 6.

142

Table 7.4. Examples of abbreviation classes registered to the system.

Abbreviation Full word

AI

B2B

DB

DBMS

E Business, E-Business

E Commerce, E-Commerce

FCA

HCI

KA

KBS

MCRDR

NRDR

OS

RDR

SE

WWW

Artificial Intelligence

Business to Business

Databases

Databases Management Systems

Electronic Business

Electronic Commerce

Formal Concept Analysis

Human Computer Interaction

Knowledge Acquisition

Knowledge Based Systems

Multiple Classification Ripple-Down Rules

Nested Ripple Down Rules

Operating Systems

Ripple-Down Rules

Software Engineering

World Wide Web

The remainder (92 terms) appeared to be tests of the system’s retrieval mechanisms

rather than identifying research areas. They included terms such as “delphi” ,

“singapore” , “help” , “ topic” and others. Some users seemed to regard the system as a

general search engine developed to support finding information within the School Web

sites. This seemed to be confirmed by terms such as “ summer project”, “ java” , “unix

primer” , “ jdk”, “telnet”, “ linux” , “scholarship” and others.

Other observations from the users’ annotation activities

Another observation from the annotation activities was that 88.1% of the annotators (52

of 59 those who carried out annotation) promptly confirmed the effect of their

annotations by browsing the lattice. Twenty-one annotators (40.4%) changed the

assigned terms after viewing the browsing structure. Overall, 35 home pages underwent

further changes. It is worth noting that we did not give any detailed explanation of the

143

annotation procedures in advance. It was simply advertised that the system was

available, and told users the system would enable them to annotate their home pages

with a set of topics. An interesting observed behaviour was that research students

examined the annotated research topics of their supervisors or collaborators who

belonged to the same research group and then selected some of these topics. The

selected topics were usually more general and described the research group overall.

The reusability of terms should be noted. The 59 researchers who actively annotated

their home pages used 591 terms. 550 (93%) of these were terms that were supported by

the system, while 41 (7%) of the terms were newly entered into the system. This is a

very high level of reuse and suggests that the support available to assist annotation must

have been useful.

In summary, the tools available to the annotators provided a good level of useful

assistance. Considering that the users received no training or even no advice on how the

system worked and what support was available, we feel this evaluation provides strong

support for the value of the approach. The tools available to the supervising knowledge

engineer were less useful, but did provide some help.

7.2.1.2. Survey: Questionnaire on the Annotation Mechanisms

This section presents the evaluation data of the annotation mechanisms from an on-line

Web-based survey. Slaughter et al. (1994) examined the effectiveness of on-line

questionnaires and reported that on-line surveys were as good as paper and pencil

surveys. Harper et al. (1997), and Kuter and Yilmaz (2001) also addressed the

characteristics of Web-based questionnaires. Perlman (1997) and Rho (2001) developed

some examples of Web-based questionnaires. Following these studies, the questionnaire

for the annotation mechanisms was designed (Figure 7.1). The questionnaire was

implemented using standard HTML forms that let users click on radio buttons and enter

comments into text areas. The questionnaire was linked to the search pages of the

system. Thirty-seven questionnaires were completed (63% of the 59 participants).

144

Figure 7.1. Questionnaire used for the annotation mechanisms.

145

Table 7.5 shows the questionnaire results on Q1 “How was the annotation mechanism to

use?” . Users were able to express their opinions on a 5-point Likert scale for each

question. The majority (95%) of the respondents characterised the annotation

mechanisms as easy to use, selecting 4 or 5 on the 5-point scale. No one rated 1 or 2 for

this question. The mean score for this question was 4.43.

Regarding the helpfulness of the annotation mechanisms, 75.5% of the respondents

considered that the mechanisms were helpful. Only 3% indicated the mechanisms as

being not really helpful. The average score for this question was 4.03.

These results are consistent with the results on the logs of the users’ annotation

activities presented in the previous section. Conclusions therefore can be made that the

annotation mechanisms were easy to use and were helpful in defining annotators’

research topics.

One respondent, who rated Question1, part 2 as 2 (i.e., unhelpful), commented that a

huge list of supported topics was daunting. The respondent suggested that a restricted

list of keywords - automatically extracted from the home page - should be presented to

the users. This issue will be discussed again with the results of Question2.

Table 7.5. The questionnaire results on the annotation mechanisms.

Q1. How was the annotation mechanism to use?

1 2 3 4 5

0 0 2 17 18

(1) Overall, it was easy to

annotate my research topics

difficult

0% 0% 5% 46% 49%

easy

1 2 3 4 5

0 1 8 17 11

(2) Overall, it was helpful in

defining my research topics

unhelpful

0% 3% 21.5% 46% 29.5%

helpful

Note: (1) = 4.43 (2) = 4.03.

X X

146

Table 7.6 presents the questionnaire results on the supported topics. The aim of these

questions was to observe whether the topics provided were appropriate and helpful for

users’ annotating, and whether they were at the right level of generality. In addition, it is

examined whether the mechanism which simply listed topics used by other researchers

in a long list was adequate or whether a more efficient method is needed.

Table 7.6. The questionnaire results on the research topics supported.

Q2. Listed research topics

1 2 3 4 5

0 1 15 18 3

(1) Are there too many

topics on the list?

too few

0% 2.5% 40.5% 49% 8%

too many

1 2 3 4 5

0 2 27 7 1

(2) Are they at the right

level of generality?

too specialised

0% 5.5% 73% 19% 2.5%

too general

1 2 3 4 5

0 0 7 25 5

(3) Are they appropriate?

inappropr iate

0% 0% 19% 67.5% 13.5%

appropr iate

1 2 3 4 5

0 0 2 22 13

(4) Were they helpful

for annotating your

research topics?

unhelpful

0% 0% 5.5% 59.5% 35%

helpful

Note: (1) = 3.62 (2) = 3.19 (3) = 3.95 (4) = 4.3.

Twenty-one participants (57%) indicated that there were too many topics on the list

with another 40.5% giving a neutral response. One participant rated 2 (i.e., few) for the

question. This respondent might think that there were not many topics which covered

his/her particular research areas. The mean score for this question was 3.62, which is to

the “many” side of neutral. As for the generality of the research topics, 73% of the

respondents gave a neutral response. Relating to the appropriateness of the terms, 81%

indicated that the topics were appropriate. Similarly, 94.5% of the respondents

characterised the listed topics as helpful for annotating their research topics.

Table 7.7 shows the cross distribution of the responses about the number of topics on

the list and their generality. Of those who regarded the number of listed topics as many

(i.e., rated as 4 or 5), 66.5% gave a neutral response to the level of generality and 24%

X X X X

147

thought the topics are too general. Only 9.5% indicated the topics as too specialised.

This seems to show that the “ long length” of the listed topics is due to the wide range of

research areas in the School, rather than a high level of specialisation.

The natural tendency for each researcher is to have a slightly different research profile.

More general topics can be shared, but more specific terms can be brought in by each

researcher to distinguish themselves. An observation from the users’ annotation

activities that we logged, indicated that all users examined the list of topics first before

adding new topics which did not exist in the list. The newly entered terms were usually

more specific and oriented to the annotator. A possible factor affecting this question is

that annotators can have different opinions depending on previous annotations. When

one annotates his/her home page after a number of collaborators have already annotated

their pages, the annotator can think that the supported topics are quite specialised or

have an appropriate level of generality. Earlier, the annotator might feel the topics

provided are too general, as the only relevant topics available are general.

Table 7.7. Cross-distribution between the number of topics on the list and their generality.

(1) Topic on the list (1)

too few too many

(2) 1 2 3 4 5 N (2)

1 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%

2 0 0% 0 0% 0 0% 2 11% 0 0% 2 5.5%

3 0 0% 1 100% 12 80% 12 67% 2 67% 27 73%

4 0 0% 0 0% 3 20% 4 22% 0 0% 7 19%

too specialised too general 5 0 0% 0 0% 0 0% 0 0% 1 33% 1 2.5%

(2)

Lev

el o

f ge

nera

lity

N(1) 0 0% 1 2.5% 15 40.5% 18 49% 3 8% 37 100%

Table 7.8 shows the cross distribution of the responses about the number of topics on

the list and their appropriateness. Of those who indicated the number of listed topics as

many (i.e., rated as 4 or 5), 81% thought the topics are appropriate and 19% gave a

neutral response. Of those who gave a neutral response for the listed topics, 87%

indicated the topics are appropriate. This seems to show that the terms are not

inappropriate even when the number of listed topics was seen as large.

148

Table 7.8. Cross-distribution between the number of topics on the list and their appropriateness.

(1) Topic on the list (1)

too few too many

(3) 1 2 3 4 5 N (3)

1 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%

2 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%

3 0 0% 1 100% 2 13% 3 16.5% 1 33% 7 19%

4 0 0% 0 0% 9 60% 14 78% 2 67% 25 67.5%

inappropriate appropriate 5 0 0% 0 0% 4 27% 1 5.5% 0 0% 5 13.5% (

3) L

iste

d to

pics

ap

prop

riat

e?

N(1) 0 0% 1 2.5% 15 40.5% 18 49% 3 8% 37 100%

Table 7.9 shows the cross distribution of the responses for appropriateness and

helpfulness of the listed topics. Those who regarded the listed topics as appropriate also

indicated the topics were helpful. Of those who gave a neutral response for

appropriateness, 71% thought the listed topics were helpful (i.e., rated as 4 or 5). The

respondents generally gave a higher rank to helpfulness than appropriateness.

Table 7.9. Cross-distribution between appropriateness and helpfulness of the listed topics.

(3) Are the listed topics appropriate? (3)

inappropriate appropriate

(4) 1 2 3 4 5 N (4)

1 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%

2 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%

3 0 0% 0 0% 2 29% 0 0% 0 0% 2 5.5%

4 0 0% 0 0% 4 57% 15 60% 3 60% 22 59.5%

unhelpful helpful 5 0 0% 0 0% 1 14% 10 40% 2 40% 13 35%

(4)

Wer

e th

ey h

elpf

ul

N(3) 0 0% 0 0% 7 19% 25 67.5% 5 13.5% 37 100%

Therefore, it can be said that the listed terms were appropriate with a reasonable level of

generality and helpful for annotating users’ research topics, even though there were too

many topics on the list to go through.

The list of topics had around 225 terms. Going through every topic in this list to chose

one’s research topics can definitely be time-consuming. To overcome this weak point,

the system supported a function to select topics from other researcher’s topics. In other

149

words, an annotator can choose other annotated researchers. Then, the annotator can

select topics from lists of topics based on these selected researchers. The researchers

may be the annotator’s collaborators or supervisors with whom the annotator may share

interests. With this function, to some extent the annotator can moderate the number of

topics to be considered. This may account for the 40.5% of respondents who rated the

topic list as 3 (i.e., reasonable). It can be noted that although selecting from other

researchers may limit the choices available, further phases of the annotation process

show other related topics.

However, as one respondent noted in the comment for Question1, a more efficient

mechanism for considering topics needs to be explored. It would be better to support a

mechanism that displays terms extracted from the annotator’s home page first. In other

words, when a researcher annotates his/her home page, terms could be extracted from

their home page and suggested first. The extraction of relevant terms to a page could be

done using some machine learning techniques.

In summary, not only were the annotation mechanisms easy to use and helpful, but also

the terms available in the system were helpful in defining annotators’ research topics.

However, a more efficient tool is required to shorten the list of available topics, rather

than having a long list showing all topics available.

7.2.2. Ontology Evolution

One goal of this thesis is to explore the possibilities of document annotation systems

that do not commit to a priori ontologies. The aim is to develop techniques for assisting

users in annotating a document as an ontology evolves. Instead of defining the ontology

from the outset, we would like the system to assist the users to make extensions to the

developing ontology so that the ontology is improved.

Recall that we have imported two taxonomies ACM and ASIS&T, and developed a

taxonomy called UNSW by combining the research areas at the School Web sites and a

number of taxonomies which are considered relevant to the School research areas.

150

One of the key components of the proposed approach is to show annotators all the

parents of any terms that occur in the imported taxonomies. Then, users determine how

relevant the proposed terms are to their documents and select any combination of

superclass terms to be added to their documents. A key component of the annotation

was in suggesting key terms and users freely selecting them without considering any

hierarchies of the terms.

The critical advantage of this is that terms that seem too general, even if only part of the

way up the hierarchy, can be omitted. The user does not have to consider if the terms

are too general, in fact the parent-child relations are not indicated - a simple list of terms

is shown. The result of this is a new taxonomy that is made up of parts of other

taxonomies that users perceive as most useful along with other terms they add. We

believe that this may provide a very simple but powerful way of validating and

improving on the ontological standards that are being established.

Table 7.10 shows the use of the imported terms. Of the 207 terms suggested for 59 cases

(researchers), 19 were used for annotation. The annotators were interviewed to

investigate the reasons why the other suggested terms were not selected. The most

common response was that the proposed topics were too general and irrelevant in

specifying their research area, even though they were applicable.

Table 7.10. The percentage of the selected terms among the relevant taxonomy terms.

A number of terms Percentage

Total suggested taxonomy terms 207 -

Selected terms 19 9.2 %

Non-selected terms 188 90.8 %

It can be assumed that the imported terms would be appropriate terms to use,

particularly since the taxonomies imported (i.e., ACM and ASIS&T) would seem to be

appropriate for a school of Computer Science and Engineering (CSE). Therefore, if it is

supposed that the taxonomies represent an a priori taxonomy of the research areas of

the School, then the percentage of the selected terms should be high.

151

However, only 19 of the 207 terms suggested were used for annotation. Recall that the

207 terms suggested (for 59 cases) are all parent terms that occur in the taxonomies for

the terms assigned by the user at the first stage. Note that these 59 cases actively

annotated their home pages. Some terms that occur in the taxonomies can be selected

when users annotate their topics at the first stage as terms in the taxonomies are also

supported at the stage. Thus, if we calculate the ratio of the terms of the taxonomies

used for annotation, the percentage of the selected terms from the taxonomies will

definitely increase. However, an important fact is that most of the general terms

suggested were not selected.

The details of the relevance of terms suggested from taxonomies and the consequent

retrieval of documents (researchers) are shown in Table 7.11. Here the Open Directory

Project, which is regarded as one of the world’s biggest human-edited taxonomies, has

also been included. But this was not available during annotation.

Recall that the lattice shows all the researchers who use a particular term, and that this

number is increased by importing terms from taxonomies and considering that pages are

implicitly annotated by any terms in the taxonomy that are parents of terms selected by

the researcher. These would be the results if the researchers were obliged to conform to

that ontology (policy).

Table 7.11 shows the number of researcher home pages retrieved using the various

terms in the left column with and without imported taxonomies. The first retrieval

column shows the number of pages retrieved using only the researchers own annotation

of their home pages as shown in the lattice. The remaining columns show the retrieval

when it is assumed that any parent terms in the various taxonomies should also apply to

pages that are annotated with their child terms. A hyphen “-” indicates that the term on

the left-hand side is not contained in the relevant taxonomy. Some terms included in

Table 7.11 have no children in the respective taxonomies so that the numbers of pages

retrieved are the same. These terms are marked with an asterisk “ * ”. This is to show the

existence of the terms in the appropriate taxonomies. Figure 7.2 shows some partial

hierarchies of the taxonomies for the terms used in Table 7.11.

152

Table 7.11. Document retrieval using various taxonomies.

Number of researcher home pages retrieved Terms

Lattice Only

ACM Taxonomy

ASIS&T Taxonomy

Open Directory Project

UNSW Taxonomy

Artificial Intelligence 39 50 45 56 58

Knowledge Engineering 3 - 32 - 30

Knowledge Representation 18 18 20 22 22

Knowledge Management 4 - - 25 25

Knowledge Discovery 7 - - 24 24

Machine Learning 24 - 24 28 28

Learning 5 21 - - -

Information Processing 1 - 11 - -

Information Retrieval 10 11 10 11 11

Internet 4 4 6 6 6

Databases 11 11 12 22 13

Computer Programming 1 11 9 1 11

Programming Languages 4 4 4 4 4

Knowledge Acquisition* 19 19 19 - 19

Spatial Representation* 3 - 3 - 3

Data Mining* 17 17 - 17 17

World Wide Web* 5 - 5 5 5

In Table 7.11, we can observe that not only do the ACM and ASIS&T taxonomies and

the Open Directory have very different ideas of what constitutes “Knowledge

Engineering” , but also that the ACM and ASIS&T taxonomies have a different scheme

on “Machine Learning” and “Learning” . The ACM taxonomy and the Open Directory

do not use the term “Knowledge Engineering”. Furthermore, the terms “Knowledge

Management” and “Knowledge Discovery” are not used in the ACM and ASIS&T

taxonomies. However, there is obviously a high degree of consistency with terms such

as “ Information Retrieval” and “Programming Languages” and “Knowledge

Representation”. The clustering of the term “Databases” is highly consistent between

the ACM and ASIS&T taxonomies. However, the Open Directory has more terms

classified under the term “Databases” that caused a big difference in the number of

retrieved pages compared with other taxonomies. “Data Mining” is a representative

example of these sub-terms.

153

Figure 7.2. An example of a different view on the hierarchies of terms.

These phenomena not only suggest random variations, but specific and relatively

consensual decisions about the value of the various terms available. There can be a

commitment to use some particular taxonomy, but there is no best structure. It therefore

seems highly advantageous to allow users in various communities very flexible access

to such ontological resources, so the most appropriate use for the community can

emerge.

Next, we would like to examine whether retrieval performance can be enhanced when a

user’s query uses the taxonomies since this is closely interrelated with the issue of how

knowledge structures should evolve over time. Traditional information retrieval has

often incorporated the use of pre-defined classification systems, thesauri or taxonomies.

A lattice-based model for information retrieval has also been associated with the use of

domain thesauri presenting an improvement in retrieval efficiency (Carpineto and

Romano 1996a; Cole and Eklund 1996a).

ACM

Artificial Intelligence Knowledge Representation

Modal Logic Predicate Logic … Knowledge Representation Languages

Learning Concept Learning ... Knowledge Acquisition

ASIS& T

Knowledge Engineering Knowledge Acquisition Knowledge Representation Spatial Representation

Arti ficial Intelligence Machine Learning Expert Systems


Artificial Intelligence Data Mining Machine Learning Case Based Reasoning

Knowledge Representation Ontologies

Semantic Web

Knowledge Management Knowledge Discovery Data Mining Text Mining Knowledge Retrieval …

Databases Data Mining Middleware Object-Oriented Relational …

154

However, the problem is that inheritance in taxonomy hierarchies for instances (objects)

is often not transitive. When a taxonomy or thesaurus is constructed, the most

appropriate thesaural entries for representing documents are selected and the entry terms

are organised in hierarchies. Hence, the inheritances in the hierarchies are not always

transmittable when the taxonomy is instantiated with objects.

In other words, the problem with a fixed hierarchical structure is that there may be no

“ right place” for a document. Let us look at the term “Data Mining” which is clustered

both under the terms “Artificial Intelligence” and “Databases” in the Open Directory

Project in Figure 7.2. Sometimes it is not clear where to place a document about “Data

Mining”, but is also about both “Artificial Intelligence” and “Databases”. Alternatively,

there may be a document about “Data Mining” ; which does not belong with either

“Databases” or “Artificial Intelligence”, but belongs with graph theory with a fixed

taxonomy or thesaurus, which will be stored inappropriately.

We examined retrieval performance on queries that are relevant to the taxonomical

terms in Table 7.11. McGuinness (2000) noted that authors often became highly literate

in the domains they are involved in. Therefore, believing that authors (annotators) are

the most appropriate agents to assign concepts for their documents, we assumed that the

performance of a search result with a topic adapted by researchers in the lattice is

complete in precision77. Based on this assumption, we computed retrieval performance

on the queries in Table 7.11. The results indicated that average retrieval performance

with the taxonomies was low in precision (an average decrease in precision of 0.35 - see

Appendix 1) than the lattice retrieval performance. Of course, this is only for search

terms that exist in the taxonomies so that the average decrease in precision can be less.

However, this result can suggest that it should not be assumed that a reasoning

mechanism along with the hierarchies of a taxonomy can always enhance retrieval

performance. It requires a more precise consideration of the involved domain.

77 Precision is the ratio of relevant documents retrieved for a given query over the total number

documents retrieved.

155

In summary, users’ selection of the terms suggested from the imported taxonomies has

been examined. Additionally, retrieval performance for the terms in the taxonomies was

examined. We believe that the experimental results confirm our view of how a

knowledge structure of concepts (or ontology) for a domain should be evolved with

emphasis on the significance of context. However, as standards for representing

ontologies take hold, these small community systems will be able to very flexibly

import ontologies and make selective use of their resources.

7.2.3. Lattice-based Browsing

A key difference between the proposed approach and the general information retrieval

approach is in the method of clustering documents for browsing. This is based on a

lattice model. As indicated earlier, users themselves annotate their home pages in

whichever way they like, assisted by the supporting annotation mechanisms. Formal

Concept Analysis then generates a conceptual hierarchy for browsing by finding all

possible formal concepts which reflect a certain relationship between the annotated

terms and home pages. The structure is based on a lattice scheme which forms a multi-

parent relationship. The system then updates the concept lattice whenever a new home

page is added with a set of topics, or the topics of existing pages are refined.

A key question is whether the browsing structure can evolve into a reasonable

consensus when multiple users freely annotate documents. The contrast with other

ontology work is that the consensus will emerge rather than being imposed by some

groups who have decided what it should be.

However, an issue here is how the emerged structure (or ontology) can be validated. We

felt this is an extremely difficult task because there is no best structure for a particular

domain and no guidelines are apparent in the literature for evaluating the efficiency of

an ontology or taxonomy. This would also be an important issue in ontology

communities. As a consequence, we have designed questionnaires and conducted a

survey to evaluate the efficiency of lattice-based browsing from the users’ point of

view. If the search performance of the system results in efficiency, it may be concluded

156

that the evolved structure is well organised as a consensus or vice versa. The survey

results will be presented in Section 7.2.3.2.

Before the survey results, we would like to present what the browsing structure

constructed by multiple users looks like, and what kinds of advantages a lattice-based

structure has over a hierarchical structure.

7.2.3.1. Browsing Structure

Figure 7.3(a), (b), and (c) show examples of the browsing structure presented in a flat

form. Recall that the 80 home pages of academic and research students have been

registered with an average of 8 research topics. The concept lattice contains 471 nodes

with an average of 2 parents per node and path lengths ranging from 2 to 7 edges. The

lattice is continuously evolving as incremental changes are made. The positive survey

results on the annotation mechanisms suggest that the browsing structure is organised

into a reasonable consensus.

=================================================================== Example 1: Root (80) Agent (6) Algorithms (6) Algorithm Design (3) Artificial Intelligence (39) Belief (4) Clustering (3) Compilation (2) Compiler Construction (2) Compiler Technology (2) Computational Algebra (2) Computational Geometry (4) Computer Architecture (2) Computer Graphics (3) Data Mining (17) Data Structures (3) Databases (11) Database Applications (8) Distributed Computing (5)

Distributed Systems (5) Electronic Commerce (12) Formal methods (4) Functional Programming (4) Human Computer Interaction (5) Image Processing (5) Information Extraction (2) Information Retrieval (10) Internet (4) Knowledge Acquisition (19) Knowledge-Based Systems (9) Knowledge Discovery (7) Knowledge Management (4) Knowledge Representation (18) Logics (9) Logic Programming (8) Machine Learning (24) Natural Language Processing (6)

Network Management (2) Neural Networks (7) Object oriented Design (2) Ontologies (5) Parallel computing (5) Pattern Recognition (4) Personalisation (3) Program Analysis (2) Programming Languages (4) Robotics (12) Semantic Web (6) Software Engineering (7) Spatial Reasoning (3) Text Mining (6) Web Services (6) Workflows (6) World Wide Web (4) XML(6)

=================================================================== Figure 7.3(a): Examples of the browsing structure that evolved.

This shows the top-level concepts of the lattice constructed by FCA. Numbers in parenthesis

indicate a number of objects which satisfy the term.

157

=================================================================== Example 2: Root (80) => Artificial Intelligence (39) Agent Theory (4) Agent (4) Cognitive Modelling (5) Cognitive Robotics (5) Combinatorial Algorithms (2) Data Mining (7) Electronic Commerce (3) Image Processing (3) Information Retrieval (5)

Knowledge Acquisition (16) Knowledge-Based Systems (8) Knowledge Discovery (6) Knowledge Representation (14) Learning (5) Machine Learning (20) Mobile agent (3) Natural Language Processing (5) Neural Networks (5)

Ontologies (4) Pattern Recognition (3) Philosophy (5) Planning (2) Quantum computing (1) Robotics (10) Spatial Reasoning (2) Spatial Representation (3) Text Mining (4)

Example 3: Root (80) => Artificial Intelligence (39) => Knowledge Representation (14) Parent Topics: Artificial Intelligence (39) Knowledge Representation (18) Sub Topics: Agent Theory (3) Agent (3) Belief Revision (7) Causal Reasoning (4) Knowledge Acquisition (6)

Knowledge Discovery (3) Logics (7) Machine Learning (5) Multi-agent systems (3) Nonmonotonic reasoning (7)

Ontologies (2) Robotics (2) Theory Revision (4)

Example 4: Root (80) => Knowledge Representation (18) Agent (4) Artificial Intelligence (14) Internet (2) Knowledge Acquisition (7)

Logic Programming (7) Machine Learning (7) Ontologies (3) Semantic Web (3)

Knowledge Management, Text Mining (3)

Example 5: Root (80) => Knowledge Representation (18) => Artificial Intelligence (14) Parent Topics: Artificial Intelligence (39) Knowledge Representation (18) Sub Topics: Agent Theory (3) Agent (3) Belief Revision (7) Causal Reasoning (4) Knowledge Acquisition (6)

Knowledge Discovery (3) Logics (7) Machine Learning (5) Multi-agent systems (3) Nonmonotonic reasoning (7)

Ontologies (2) Robotics (2) Theory Revision (4) Cognitive Modelling, Fuzzy Concepts (2)

=================================================================== Figure 7.3(b): Examples of the browsing structure that evolved.

The term “Knowledge Engineering” is categorised under the term “Artificial

Intelligence” (Example 3) and the term “Artificial Intelligence” can also be organised

under the “Knowledge Engineering” (Example 5). These structures are based on a

lattice which forms a multi-parent relationship as seen in Example 3 and 5. Figure

7.3(c) shows similar examples.

158

=================================================================== Example 6: Root (80) => Artificial Intelligence (39) => Data Mining (7) Parent Topics: Artificial Intelligence (39) Data Mining (17) Sub Topics: Database Applications (2) Learning (3)

Machine Learning (5) Robotics (2)

Example 7: Root (80) => Data Mining (17) Agent (2) Algorithms (2) Artificial Intelligence (7) Clustering (2) Data Structures (2) Databases (7)

Database Applications (6) Electronic Commerce (6) Information Retrieval (4) Machine Learning (6) Mobile agent (2) XML (4)

Example 8: Root (80) => Data Mining (17) => Databases (12) Parent Topics: Data Mining (17) Databases (12) Sub Topics: Computational Geometry (2) Database Applications (5)

Electronic Commerce (4)

Example 9: Root (80) => Databases (12) Data Mining (7) Database Applications (7) Electronic Commerce (5) Information Retrieval (5) Knowledge Discovery (3)

Machine Learning (3) Semantic Web (4) Web Services (5) XML (4) Knowledge Representation (3)

=================================================================== Figure 7.3(c): Examples of the browsing structure that evolved.

The main difference between lattice-based browsing and a standard browsing scheme is

in the structure of the hierarchy. In the case of a standard browsing scheme, browsing is

usually organised in a hierarchical tree structure by locating more general concepts at

the top so there is only one path from the root to a given cluster. The lattice allows

multiple paths. It would be better to support all practicable structures which reflect all

possible inter-relationships within and between objects and their attributes in the system

as shown in the examples of Figure 7.3(b) and (c), rather than support only one

hierarchy.

159

For example, the term “Knowledge Representation” is generally categorised under the

term “Artificial Intelligence” . However, the structure can also be organised from the

“Knowledge Representation” point of view. In other words, the term “Artificial

Intelligence” can be organised under the term “Knowledge Representation” as shown

Example 5 of Figure 7.3(b).

Of course, in a hierarchical approach, it is also possible to organise one term into a

number of clusters. However, relationships between these clusters are specified and

maintained by human experts manually in order to keep consistency, but this is not an

easy task. This problem will be exacerbated when the size of the knowledge base

increases. From this point of view, the concept lattice can have advantages over the

hierarchical approach. FCA formulates all possible relationships between terms

automatically in accordance with knowledge base updating while maintaining

knowledge base consistency.

A more critical advantage of lattice browsing is that it allows one to reach a group of

documents via one path, but then rather than going back up the same hierarchy and

guessing another starting point, one can go to one of the other parents of the present

node as a way of navigating across the domain.

For example, suppose that a user finds “Data Mining” under “Artificial Intelligence” ,

noticing that there are 7 researchers in this area as shown in Example 6 of Figure 7.3(c).

This node has 2 parents and so the lattice view makes it obvious that there are in fact 17

researchers in the School who do research in “Data Mining” as in Example 6 of Figure

7.3(c). If the user goes up to this node, the user then finds that there are nodes with

“Data Mining” and “Databases” (see Example 7 and Example 8). The user can then

navigate down to these nodes populated by researchers whose more generic interest is

databases. These researchers tend to focus on data mining with database applications or

database techniques such as association rules, while the AI data-miners tend to use

techniques developed in machine learning. There are also other research areas and

researchers associated with the term “Data Mining” besides these two groups.

160

According to our observation on the log of users’ search behaviours, as expected, most

users navigated across the lattice - alternatively traversing up different parent concepts

and down different child concepts. Generally the users started browsing from a very

general term. Then they selected a more specific term of interest that co-occurs with the

general term. If a branch centred with the specific term exists in the lattice, they then

navigated the branch of the specific term.

For example, suppose that a user starts navigation from the term “Databases” and

selects a more specific term “Semantic Web”, a sub-concept of the term “Databases” in

the lattice. The user then looks at the search result that displays researchers who are

doing research on Databases and the Semantic Web. Then the user usually browses the

concept “Semantic Web” as the concept “Databases, Semantic Web” has two parents -

“Databases” and “Semantic Web” in the lattice.

From this point of view, the lattice-browsing scheme clearly has advantages over the

hierarchical approach where a user simply goes back to the top and starts again. In fact,

the hierarchical tree structure, in which each cluster has exactly one parent, is embedded

into this lattice structure. Furthermore, as there is a range of views on what an optimal

taxonomy might be, use of a lattice approach avoids having to commit to any one

taxonomy. The actual preferred usage of terms emerges rather than being prescribed.

7.2.3.2. Survey: Questionnaire on Lattice-based Browsing

This section presents the evaluation data from the on-line Web-based survey which was

carried out on lattice-based browsing. Figure 7.4 and 7.5 show the survey questions.

The questionnaire was implemented using standard HTML form and JavaScript that let

users click on radio buttons, check boxes and enter text and comments into text areas.

The implementation style is the same as for the questions on the annotation

mechanisms.

161

Figure 7.4. The first and second questions used in the survey of lattice-based browsing.

162

Figure 7.5. The third and forth questions used in the survey of lattice-based browsing.

163

Purpose of the survey

The objective of this survey was to evaluate the efficiency of lattice-based browsing

from the users’ point of view. In addition, the survey aimed at revealing user

preferences for search methods in domain-specific document retrieval.

Methods

The questionnaire was made available when the system was deployed on the School

Web site. There were links to the questionnaire on the browsing pages of the system.

E-mails were sent to the researchers in the School inviting them use the system and to

complete the survey. To obtain feedback from outside users the link to the questionnaire

in the browsing pages was highlighted. The data was collected by a cgi program.

The questionnaire contained 16 questions in four parts. The first part of the

questionnaire was to identify the purpose of using the system. The second part aimed to

investigating retrieval performance of the system. The third part aimed at identifying

user preferences for search methods for a specialised domain (Boolean search,

hierarchical browsing and lattice-based browsing). The last part was to measure user

satisfaction with the system performance and the user interface. Most questions used a

five-point Likert scale to measure users’ view and other questions using the check box

format to allow multiple answers.

Results

There were 40 questionnaires fil led in. Table 7.12 shows the respondents information.

Most of the respondents were researchers and current or prospective research students at

UNSW (i.e., (1) + (2) + (4) + (5) in Table 7.12). Only one respondent was an outside

user from industry, but all respondents were affiliated with information technology.

Table 7.12. The respondents’ information.

(1) A current research student in information technology 19 (2) A prospective research student 2 (3) An industry person 1 (4) From CSE or EE at UNSW 17 (5) From elsewhere at UNSW 1 Total 40

164

Table 7.13 shows the main purpose for using the system. Thirty-two of the respondents

(80%) were looking at the lattice-based browsing mechanism. As well 65% were

looking for specific research areas and 55% were browsing the School of Computer

Science and Engineering for study, research and collaboration opportunities. Note that

the respondents were allowed to choose multiple items for this question.

Table 7.13. The purpose of the use of the system.

Q1. Would you describe your reasons for using this system?

1 2 3 4 5

3 4 7 17 9

(1) Looking for a specific

research area

disagree

7.5% 10% 17.5% 42.5% 22.5%

agree

1 2 3 4 5

4 7 7 9 13

(2) Browsing CSE for study,

research and collaboration

opportunities

disagree

10% 17.5% 17.5% 22.5% 32.5%

agree

1 2 3 4 5

1 14 11 8 6

(3) Trying to get an overall

impression of CSE

research capability

disagree

2.5% 35% 27.5% 20% 15%

agree

1 2 3 4 5

0 4 4 18 14

(4) Having a look at

a lattice-based browsing

mechanism

disagree

0% 10% 10% 45% 35%

agree

(5) Other N/A (No Answer)

Table 7.14 summarises the questionnaire results on retrieval performance of the system.

Thirty-eight participants (95% of the respondents) replied that they succeeded in finding

what they were looking for, whereas two participants (5%) replied that they failed.

There were six individuals among the respondents who chose “YES” who also

answered sub-question3 “What were the reasons (for failure)” , as shown in Table 7.15.

This means that the six respondents (15%) may have found some relevant information,

but not all they expected.

165

Table 7.14. The questionnaire results on retrieval performance.

Q2. Did you find what you were looking for? (Yes: 38, No:2)

If Yes Responses: 38

1. What did you find? No. of cases Percentage

(1) An individual researcher and his/her research areas 33 87%

(2) A broader group of researchers 17 45%

(3) Some interesting cross-disciplinary areas 13 34%

2. How did you find it?

1 2 3 4 5

1 4 8 17 8

(1) I used mainly

search terms

2.5% 10.5% 21% 45% 21%

browsing

1 2 3 4 5

0 3 7 24 4

(2) Number of steps to get your result

many steps

0% 8% 18.5% 63% 10.5%

few steps

If No Responses: 8 (Yes: 6, No:2)

3. What were the reasons? No. of cases

(1) I had a pretty thorough search so I think the area is not covered 4

(2) The keywords available for browsing were not appropriate 3

(3) The browsing is too unstructured and I got lost 2

(4) I was unfamiliar with how to use this system 3

Note: 2. (1) = 3.68 (2) = 3.76.

Table 7.15. A cross table with respondents and the reasons they failed for retrieval.

Find (1) Not covered

(2) Keywords not appropriate

(3) Browsing unstructured

(4) Unfamiliar

Respondent1 No 3 3

Respondent2 No 3 3

Respondent3 Yes 3

Respondent4 Yes 3

Respondent5 Yes 3

Respondent6 Yes 3

Respondent7 Yes 3 3 3

Respondent8 Yes 3

X X

166

If we have a closer look at the reasons for the failure in sub-question3 in Table 7.15, 4

respondents considered the reasons for failure as being “unfamiliar with how to use the

system” and “ looking for an uncovered research area” (respondent 3, 4, 6 and 8). On the

other hand, another 4 respondents (10%) experienced a failure in finding documents due

to “ inappropriate keywords available for browsing” or “unstructured browsing”

(respondent 1, 2, 5 and 7). These results seem to reveal that there is a need to develop a

mechanism to refine a concept lattice in a more structured way for browsing. However,

as observed above, 95% of the respondents replied that they succeeded in their retrieval.

Conclusions therefore can be made that the search performance of the system is

reasonably efficient from the users’ point of view.

The majority of the respondents (87%)78 indicated that they found individual

researchers and their research areas as shown in Figure 7.6. Seventeen participants

(45%) indicated that they found a broader group of researchers and thirteen respondents

(34%) found some interesting cross-disciplinary areas. Note that the respondents were

allowed to choose multiple items for this question. These results indicate that the

concept lattice not only forms clusters that locate documents at their proper position, but

also formulates interesting inter-relationships among the documents and their concepts

(for example, a group of researchers and cross-disciplinary areas). This finding seems to

suggest that related documents were found by browsing laterally across the lattice and

demonstrate the power of lattice based browsing.

Figure 7.6. The questionnaire results on “What did you find?”.

78 Thirty-three of 38 who answered “YES” for Q2.

010

20

30

4050

60

7080

90100

Per

cen

tag

e

An individual researcher and his/her research areas

A broader group of researchers

Some interesting cross-disciplinary areas

167

For the first part of sub-question2 (How did you find it? - I used mainly “search term”

to “browsing” in Table 7.14), 66% of the respondents indicated that they mainly used

browsing (i.e., rated as 4 or 5 on the 5-point scale), whereas 13% mainly used search

terms. Another 21% used both. From the log of users’ search activities, we observed

that some respondents who used the topic list replied that they used search terms in

finding documents.

There were a number of typical patterns of users’ behaviour when they were looking for

documents using the system according to the log of users’ search.

(1) Selecting one topic among the topics listed first and browsing the lattice starting

from the node found by the selected topic or from the root of the lattice (iteratively

and repeatedly) - 47%.

(2) Entering a search term in the Boolean query interface first and browsing the lattice

starting from the node found by the search term (iteratively and repeatedly) - 2%.

(3) Combination (1) and (2) (i.e., select one topic or enter a search term alternatively

and browsing the lattice) - 51%.

Most instances of search behaviour pattern of (1) occurred during the annotation process

when researchers would be checking their own research topics (i.e., viewing the

positioning of their topics in the lattice). Most genuine instances of search involved in

the combination pattern (3). Searching was an iterative process - formulating a query

and browsing the lattice looking for search results, and changing query terms. Users do

not enter a single term and only look at its search result.

For the second part of sub-question2 (Number of steps to get your result - “many steps”

to “few steps” in Table 7.14), twenty-eight participants (73.5% of the respondents)

replied that few steps were taken to obtain a result. Seven participants (18.5%) rated this

question as 3. Only 8% replied that many steps were taken to get a result. Thus, it can

be said that the number of steps taken to find a search result was reasonable.

Table 7.16 shows the cross distribution of the responses for the used search methods

and the number of steps taken to get a result. Most respondents who rated 2, 3, 4 and 5

168

for the question “ (1) I used mainly ‘search terms’ - ‘browsing’ ” , indicated that they took

a few steps when they were looking for documents. These results seem to show that the

search methods both Boolean query and lattice-based browsing provided in the system

were efficient. Of those who used mostly browsing (i.e., rated as 5), 75% replied that

they took a few steps in finding documents. Of those who used mainly browsing (i.e.,

rated as 4), 82% regarded the steps taken as few. We can therefore conclude that the

browsing search performance was quite efficient; however there is no evidence that

browsing is more efficient than using search terms (i.e., Boolean query)79.

Table 7.16. Cross-distribution between the used search methods and the number of steps taken.

(1) I used mainly (1)

search terms browsing

(2) 1 2 3 4 5 N (2)

1 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%

2 1 100% 0 0% 1 12.5% 1 6% 0 0% 3 8%

3 0 0% 1 25% 2 25% 2 12% 2 25% 7 18.5%

4 0 0% 2 50% 4 50% 13 76% 5 62.5% 24 63%

many steps few steps 5 0 0% 1 25% 1 12.5% 1 6% 1 12.5% 4 10.5%

(2)

Num

ber

of s

teps

N(1) 1 2.5% 4 10.5% 8 21% 17 45% 8 21% 38 100%

Table 7.17 presents the questionnaire results on user opinion on search methods for

domain-specific document retrieval. Twenty-five of the respondents (65%) considered

Boolean queries and hierarchical browsing as a helpful in searching a specialised

domain. For lattice-based browsing, 90% of the respondents regarded it as helpful. Note

that no one rated lattice-based browsing as 1 or 2. The calculated chi-square value for

Table 7.17 is statically significant ( = 13.95 at 4 degrees of freedom)80, indicating

that there is a relationship between the search terms and helpfulness. As a consequence,

it can be said that lattice-based browsing was regarded as a more helpful search method

for domain-specific document retrieval than Boolean query and hierarchical browsing.

79 Godin et al. (1993), and Carpineto and Romano (1995; 1996b) evaluated search performance by

comparing these two methods. Our experiments have not focussed on this comparison. 80 This is greater than the critical chi-square values at the level of 95 percent confidence (9.488) and 99

percent confidence (13.277). A more detailed chi-square matrix refers to Appendix 2.

2χ

169

Many users (at least 65%) also thought that Boolean queries and hierarchical browsing

would be helpful for searching such a domain as shown in Table 7.17. However, it is

not clear that whether users needed or wanted to use both methods or whether they were

used to searching with search terms.

Table 7.17. User opinion on search methods for domain-speci fic document retrieval.

Q3. Please give your opinion for searching this sort of a domain

1 2 3 4 5

0 5 9 19 7

(1) Entering search terms

- boolean query

unhelpful

0% 12.5% 22.5% 47.5% 17.5%

helpful

1 2 3 4 5

0 1 13 19 7

(2) Hierarchical browsing

- tree structure

unhelpful

0% 2.5% 32.5% 47.5% 17.5%

helpful

1 2 3 4 5

0 0 4 25 11

(3) Lattice-based browsing

- network structure

unhelpful

0% 0% 10% 62.5% 27.5%

helpful

Note: (1) = 3.7 (2) = 3.8 (3) = 4.18.

Table 7.18 shows the cross-distribution between lattice-based browsing and hierarchical

browsing choices. Of those who selected lattice-based browsing as a helpful search

method for searching in a specialised domain (i.e., rated as 4 or 5), 69% indicated that

hierarchical browsing would also be helpful. Of those who responded that they mainly

used lattice browsing in finding documents (sub-question2 of Q2 in Table 7.14), 64%

considered that hierarchical browsing would also be a helpful search method.

Table 7.19 shows the cross-distribution between lattice-based browsing and Boolean

query choices. Of those who selected lattice-based browsing as helpful (i.e., rated as 4

or 5), 67% thought that Boolean query would also be helpful.

Therefore, we conclude that more diverse search interfaces and methods combined with

lattice-based browsing are probably appropriate to meet the different preferences.

X X X

170

Table 7.18. Cross-distribution between lattice-based and hierarchical browsing choices.

Q3. Please give you opinion for searching this sort of a domain

(3) Lattice-based browsing (3)

unhelpful helpful

(2) 1 2 3 4 5 N (2)

1 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%

2 0 0% 0 0% 0 0% 0 0% 1 9% 1 2.5%

3 0 0% 0 0% 3 75% 8 32% 2 18% 13 32.5%

4 0 0% 0 0% 1 25% 17 68% 1 9% 19 47.5%

unhelpful helpful 5 0 0% 0 0% 0 0% 0 0% 7 64% 7 17.5%

(2)

Hie

rarc

hica

l bro

wsi

ng

N(3) 0 0% 0 0% 4 10% 25 62.5% 11 27.5% 40 100%

Table 7.19. Cross-distribution between lattice-based browsing and Boolean query choices.

(3) Lattice-based browsing (3)

unhelpful helpful

(1) 1 2 3 4 5 N (1)

1 0 0% 0 0% 0 0% 0 0% 0 0% 0 0%

2 0 0% 0 0% 1 25% 4 16% 0 0% 5 12.5%

3 0 0% 0 0% 1 25% 8 32% 0 0% 9 22.5%

4 0 0% 0 0% 0 0% 13 52% 6 54.5% 19 47.5%

unhelpful helpful

5 0 0% 0 0% 2 50% 0 0% 5 45.5% 7 17.5%

(1)

Boo

lean

que

ry

N(3) 0 0% 0 0% 4 10% 25 62.5% 11 27.5% 40 100%

Table 7.20 shows user opinion on the system performance and the user interface of the

system. Many respondents indicated that they felt the system was fast enough. We

installed an Apache Server for Windows NT 4.0 on a personal computer which has 64M

RAM and a 200MHz Pentium processor81 and expected that the system performance

would be slow when connected to the Internet. Sometimes the system performance was

so slow that arrangements were made to install the system on a high-capacity computer,

but surprisingly the results indicate that many respondents were satisfied with the

system performance.

81 As indicated earlier, the system was developed with Java, JavaScript and Java Servlets (Java CGI)

supported by a Web browser of Netscape 4.0 and Explorer 5.0 or higher.

171

Table 7.20. The questionnaire results on the system performance and the user interface.

Q4. About the system performance and user interface

N/A 1 2 3 4 5

0 1 0 12 15 12

(1) I feel the system

is fast enough

too slow

0% 2.5% 0% 30% 37.5% 30%

fast enough

N/A 1 2 3 4 5

0 0 0 8 22 10

(2) I feel the user

interface is ok

bad

0% 0% 0% 20% 55% 25%

good

N/A 1 2 3 4 5

5 0 1 13 16 5

(3) The help function

are adequate

bad

12.5% 0% 2.5% 32.5% 40% 12.5%

good

Note: N/A = No Answer, (1) = 3.93 (2) = 4.05 (3) = 3.71.

It was important that interface comments were positive. The user interface of lattice-

based browsing under FCA is usually based on the lattice graph itself. However, our

focus was on a Web-based interface using a hypertext representation of the links to a

node in a lattice, without a graphical display of the overall lattice. Only a single node

and only its immediate parents and children are displayed. The children and parents are

hypertext links. Even though graphical views of the whole lattice can give interesting

perspectives on the whole domain, they are probably of little interest to someone who is

only interested in finding a document and there is an advantage in simplicity.

According to a survey on Internet search engine usage (Chen et al. 2000), features that

ranked as most important were ease of use, accuracy, reliability and speed, but “ease of

use” ranked relatively high compared to other features. We believe that the hypertext

representation is easy to use and a fairly natural simplification of a lattice for Web users

and the above results appear to support our belief. It seems to be providing the type of

interaction that users are comfortable with on the Web.

Regrading the help function, 52.5% of the respondents indicated that the help function

was good with another 32.5% giving a neutral response. Only 2.5% replied that the help

function was bad. In addition, 5 participants (12.5% of the respondents) did not answer

this question. They may have thought that no help function was available. One

X X X

172

respondent suggested that some explanation of what lattice-based browsing is in the

opening page would be helpful. Thus, it can be useful to include a page which gives a

more detailed explanation of lattice-based browsing (such as what lattice-based

browsing is, how it works and so on) so that users can gain the best use of it.

In fact, there were no visible help functions given. Only a brief explanation on lattice-

based browsing was given at a link “About the system” in the search pages. Hence, it

might be said that only those 5 participants who did not respond (12.5% of the

respondents) gave the right answer. However, the respondents who answered this

question may have felt no need for help functions. They may have thought that the user

interface provided was quite comprehensive and natural to use. Note that no one

indicated that the user interface was bad in the second question of Q4 as shown in Table

7.20. Most respondents replied that the user interface was satisfactory.

In summary, the survey respondents indicated that the search performance of the system

was efficient. As well the number of steps needed to get a search result was small or at

least reasonable. The majority of the respondents replied that they found individual

researchers and their research areas. Some indicated that they also found a broader

group of researchers and some interesting cross-disciplinary areas. This seems to

support the hypothesis that the concept lattice not only formulates clusters which locate

documents at their proper position, but also formulates additional inter-relationships

among the documents establishing research groups and cross-disciplinary areas.

Many respondents indicated that they were satisfied with the system performance as

well as the user interface. It seems that the user interface was fairly comprehensible and

natural to use for Web users, because many respondents considered that the help

questions was adequate, even though there were no concrete help functions given.

Lattice-based browsing was considered as a more helpful search method for domain

specific document retrieval than Boolean query or hierarchical browsing. Nevertheless,

many respondents considered not only lattice-based browsing, but also Boolean query

and hierarchical browsing would be helpful for searching such a domain. This seems to

173

suggest that more flexible combination techniques are desirable to meet the different

needs of users.

On the other hand, there seems to be a requirement to refine a concept lattice in a more

structured way with more appropriate terms as some of the respondents indicated that

they have experienced failure in finding documents due to inappropriate keywords

available for browsing or unstructured browsing. However, this does not negate the

statistical conclusion that lattice based browsing is more helpful than Boolean queries

and hierarchical browsing. Our conclusion is that lattice based browsing is a step

forward but does not fully overcome all the problems in searching for documents.


This chapter presented the experimental results on the proposed system for the domain

of research interests in the School of Computer Science and Engineering, UNSW. The

annotation mechanisms available to annotators provided a good level of assistance. The

most interesting result suggested that although an established external taxonomy could

be useful in suggesting annotation terms, small communities appeared to have little

interest in adhering to “standard” hierarchical structures. This result confirms one of our

motivations for this thesis – it is advantageous to allow users in small communities very

flexible access to established ontological resources, so the most appropriate use for the

community can emerge.

A browsing structure that evolves in an ad hoc fashion provided good efficiency in

search performance. In addition, lattice-based browsing was considered as a more

helpful method than Boolean query or hierarchical browsing for searching a specialised

domain. Moreover, many users were satisfied with the system performance as well as

the user interface of browsing. The experimental results seem to support the hypothesis

of the power of lattice-based browsing over the hierarchical approach – the lattice

structure showed that the pathway traversed found only a small number of documents,

but there were other related documents that came out of other research approaches. It

also showed users alternative pathways to a node, which the users had not yet navigated,

which might lead to documents of interest.

174

Chapter 8

Discussion and Conclusion

This chapter presents a summary of the thesis. We then conclude with a discussion of

possible future directions of the research presented in this thesis.

8.1. Motivation

Most work on document management and retrieval intended for Web-based documents,

focuses on either improved search engines or ontology development. The assumptions

with the ontology approach seem to be that since communities do communicate, there

must be consensus about terms; so the main thing that needs to be done is to identify

and formalise this consensus. This will then result in a standard ontology, and anyone

wishing to communicate in a domain, will be keen to use the standard ontology and reap

the benefits that any documents that use the standard ontology will be much more

readily retrieved and used by others.

We have no dispute with this, except the lack of focus on how the consensus

represented by an ontology emerges and evolves. For example, the classic paper by

Shaw (1988) on use of terms in geology showed, that left to their own devices,

geologists described geology in quite different terms and that they disagreed and

misunderstood the sets of terms they had independently developed. However, she

concluded that despite this apparent confusion, these geologists had little trouble in

working together and understanding each other when working together. This suggests

that attempting to get a consensus that everyone agrees with and then works within will

be difficult. In this regard, it appears that the working group on an upper level

ontology, has effectively broken up in disarray, unable to agree on an ontology

(Gangemi et al. 2002). In fact, some ontology researchers like Gangemi et al. see their

175

goal as developing formalisms to assist in designing proper ontologies – but that these

will be disposable, and their only value will be in how much they are used.

Our approach has been rather to look for tools and techniques by which a de-facto

consensus might emerge and might evolve further. The main application of this is for

small groups and communities. Any such tools would need excellent browsing,

particularly to find related but unexpected concepts and tools to assist users to re-use

terms used by others, but not to constrain them.

The system developed to meet these goals uses Formal Concept Analysis (FCA) to

support flexible browsing and has a number of mechanisms to encourage others to re-

use terms. In particular, it encourages users to select terms from external ontologies,

where such exist.

8.2. Summary of Results

We implemented this system and carried out an evaluation in using it to assist

proposective students and others find research supervisors and collaborators in the

School of Computer Science and Engineering, UNSW. We logged users’ actions in

browsing and annotating, and some users fil led in a Web questionnaire.

8.2.1. Annotation Mechanisms

A number of knowledge acquisition techniques and tools were developed to suggest

possible annotations. The survey results indicate that the various annotation tools

assisted the users in defining their research topics so that the lattice-based browsing

structure that evolved in an ad hoc fashion was organised into a reasonable consensus

with good efficiency in retrieval performance. In particular, it should be noted that there

was no training or help provided for the users. The aim was to have a self explanatory

system. However, we do not have results that demonstrate that the users were more

effective in annotating their pages with these mechanisms than without.

176

8.2.2. Lattice-based Browsing

It was clear from the results that there is an advantage in lattice-based browsing over a

hierarchical approach. If one fails to find the appropriate document, one can ascend to

the top of the lattice by another pathway. In other words, lattice browsing allows one to

reach a group of documents via one path, but then rather than going back up the

hierarchy and guessing another starting point, one can go to one of the other parents of

the present node as a way of navigating across the domain. The critical problem with

hierarchical browsing is that if a user does not find the required document the user will

not be sure what to do next - the user has already made his/her best guesses at various

decision points. Lattice-based browsing shows the user alternative pathways to a node,

which the user has not yet navigated and which may lead to documents of interest.

The survey results and examples shown support this hypothesis and demonstrate that

lattice browsing can help the user find both what they are looking for and also

interesting related documents.

8.2.3. Web-based System

Another emphasis of our approach was a Web-based system using a hypertext

representation of the links to a node in a lattice without a graphical display of the overall

lattice. We focused on simplicity and familiarity for Web users. The survey results

indicate that the Web implementation we used provides a fairly easy environment for

users who are familiar with the Web. In contrast, although graphical views of the whole

lattice may give an interesting perspective on the overall domain, they are probably of

little use to someone who is interested in finding a document.

8.2.4. Imported Ontologies

The one area where there was a strong result from the way users behaved was with

using imported ontologies. We were able to demonstrate that users did use external

ontologies, but did so very selectively. The results suggest that the external ontologies

are of value as a resource, but in small communities and specialised domains, people

prefer to pick and choose what is of value from ontologies. It seems likely that within a

177

small community, even a quite diverse community, selective use will be made of a more

global ontology and this usage pattern can itself become a useful ontology for other

groups.

It should be noted that in this particular application one would have expected the

taxonomies used to have a reasonable match with the group annotating documents. So

the usage of terms probably says something about the relevance of these terms to the

task of identifying the research interests of an individual. For example, the term

“knowledge engineering” although seemingly a useful concept, probably gives little

idea to prospective students and collaborators of the particular style of research carried

out, since under this term there are some very different areas of research. Hence,

although it was suggested from the external ontologies, it was little used. On the other

hand, “knowledge representation” covers a much more coherent style of research.

8.3. Expectations for Other Domains

A major issue is whether the experience we have had in this domain will apply to other

domains; i.e. will the complexity of lattice become too great in other domains.

In a lattice structure, all possible document subsets can produce an exponential number

of lattice nodes. However, Godin et al. (1986) examined the worst case time complexity

of a lattice structure linearly bounded with the number of documents when the number

of terms for each document has an upper bound, which is usually the case in practical

applications; i.e., |H* | ≤ 2kn where |H* | is the number of all formal concepts, n is the

number of documents and k is the mean number of terms per document. This means that

|H* | / n is much smaller than the upper bound 2k. In fact, their experiments in several

domains showed that in every application, |H*| / n ≤ k. For example, in one of the

experiments with 3042 documents and an average 11.1 terms per document, 23471

lattice nodes were produced. This means that the average number of lattice nodes per

document 7.7 (23471/3042) is much less than the upper bound 211.1. In our experiment,

a document is associated with 5.9 concept nodes (471/80) with an average 7.97 terms

178

for a homepage. Godin et al. (1986) also indicated that average values for search time

with the hierarchical method (3.90 min) are slightly better than the lattice method (3.95

min). But the difference is not significant. Hence, we anticipate that the complexity of

the lattice structure will not be a significant problem. In addition, one can surmise that

since single inheritance hierarchies work reasonably well, without massive repetition of

concepts in different part of the hierarchies, a lattice is unlikely to have a very high

degree of interconnection. That is, although the complexity is potentially great, there is

no evidence that the world is such that this will occur.

However, in the limit, it is possible that the lattice will become too complex.

Nevertheless, we do not expect this to occur since, as we believe that any co-operative

effort in lattice building will converge towards a system that is reasonably useable.

Since the annotator is highly likely to use the lattice first and is also prompted by the

terms that are used, they are likely to converge on a reasonable size. There is a tension

for users between adding a document with sufficient annotation terms so that it is

distinguished from all other documents, but on the other hand leaving the lattice

sufficiently useable so that the document can be found.

The use of the ontology editor and the role of some sort of lattice manager (knowledge

engineer) may also be significant. If the initial concepts added are too fine grained, then

the ontology editor is likely to be used to produce conceptual scales to group the refined

concepts into a broader concept. A further extension would be that the manager might

hide very refined concepts, however we have not yet included this facility.

As well, there are further possibilities of developing separate sub-lattices if the

community takes on a domain that is too big for a single lattice. Again this could be

achieved using conceptual scales.

8.4. Future Work

Although we believe FCA is a useful way of supporting the flexible open management

of documents, there are a number of areas of further development that need to be

explored.

179

8.4.1. Ontologies

The most interesting result from our study was in the use of the imported taxonomies.

However, there are a number of areas requiring further research for improvement of our

approach regarding ontologies.

Importing Standard Format Ontologies: First of all, it will be essential to deal with

Web-based ontology representation languages such as XML, OIL, DAML or

DAML+OIL instead of the proprietary text formats used currently. Annotation

mechanisms which commit to ontologies, mostly use one of these ontology

representation languages. The main aim of using these languages is to facilitate the

sharing of information between communities (or agents) as well as individuals within

the communities. A Web Ontology Working Group has also been organised to construct

a standard Ontology Web Language (OWL) for the Semantic Web.

Note that the FCA community has founded the Tockit project82 “ to write a framework

for Conceptual Knowledge Processing” (http://tockit.sourceforge.net/tockit/index.html,

2003). The aims of this project include defining a XML standard for Formal Concept

Analysis and ontology guided document retrieval.

We will move to a system whereby the user can simply specify any URL from where an

ontology in a standard format can be imported and used in the system.

Importing Conceptual Scales: Another area requiring further attention is related to

importing the ontological structure of a domain into conceptual scales. We have adapted

conceptual scaling of FCA to scale up the browsing structure of the system with

ontological information where readily available such as author, person, academic

position, research group and so on. These correspond to the type of more structured

ontology information used in the system such as KA2. We included such information

for interest in this study. However, ideally we would desire conceptual scales from an

imported ontology. The use of these scales could be automated if the document was

appropriately marked up according to the ontology. This would give us a system that

82 http://tockit.sourceforge.net/ (2003).

180

was flexible and open, but also had the type of ontological commitment represented by

the KA2 initiative. It will be interesting to examine the trade-offs in allowing such

requirements to emerge rather than anticipating them and also the relative costs in

marking up documents rather than providing information to a server.

Ontology Editing: We have implemented a tool which allows a knowledge engineer (or

user) to identify abbreviations, synonyms or groupings. The group hierarchy is then

used for conceptual scales. The tool is a fairly simple and standard editor. But it allows

only a single inheritance hierarchy in each grouping, so that an extension to the tool

may be required to be able to handle more complex ontologies. Note that there are a

number of well-established ontology editors in the literature such as Protégé83, OilEd84

and OntoEdit85.

As a possible alternative for handling more complex ontologies, a user can be allowed

to group ontological attributes (i.e., groupings) and also to build a hierarchy of groups

of groupings. The groupings can be named. This would be a hierarchical representation

of an “ontology” similarly to how a browsing ontology is presented in the KA2

initiative. These “ontologies” would be constructed on the fly, but stored for future use

if required. The user would be free to select any one of these ontologies to interact with

the system, and use this interaction to move to a particular sub-lattice. In this area it

may be possible to use some form of machine learning to select likely candidate nodes

for grouping together.

Multiple Ontology View: We have imported a number of taxonomies (i.e., ACM,

ASIS&T and Open Directory Project) from commonly available Web sites to suggest

possible annotations. Another aim of importing ontologies is to give a different lattice

view based on one of these taxonomies. This is to assist people who would be expected

to have a more superficial knowledge of the terms used for document annotation based

on a certain taxonomy. As ontology representation standards become better established,

83 http://protege.stanford.edu/ (2003). 84 http://oiled.man.ac.uk/ (2003). 85 http://www.ontoknowledge.org/tools/ontoedit.shtml (2003).

181

importing taxonomies and using them to give a different lattice view should only

require entering a URL. As well, different users may develop different ontological

views using conceptual scales. These would also be able to be selected by other users.

Exporting Ontologies: The various ontological views that are developed would also be

exported using standard ontology formats. The current system keeps the annotation

separated from the actual documents. If desired, documents can be copied with their

annotation added according to standard mark-up languages and exported for use with

other software.

8.4.2. Annotation Support

The system supports a number of annotation tools to assist users to easily annotate their

documents and to be able to reuse terms used by others, and those imported from

taxonomies. However, it would also be useful to use machine learning or natural

language processing techniques to identify “key terms and phrases” and/or “ontological

attributes” from the annotated document. These suggestions could be integrated with

the tools that the current system uses. Note that some studies (Aussenac-Gilles et al.

2000; Maedche and Staab 2000; Handschuh et al. 2002) have investigated automated or

semi-automated ways of discovering the appropriate ontology for a document.

Another issue that we anticipate is the handling of documents which consist of several

sections that are about different topics. In the current system, each document is

represented as a bag of words (i.e., a set of keywords) which are simply summed up

across the different sections.

8.4.3. Integration with Other Techniques

As described earlier, we have integrated a number of information retrieval techniques

into lattice browsing. Firstly, a Boolean query interface is combined with the FCA

browsing interface with some normalisation techniques such as eliminating stopwords,

stemming and expanding user queries based on synonyms and abbreviations. Secondly,

182

a textword search is supported when the system fails to get a result from the lattice and

is used to show a sub-lattice. However, it is also likely that there are a number of areas

where our approach may be integrated with general Web search engines and their

techniques.

A simple change would be to replace the textword search engine with a standard engine

such as Google. However, this may require all the referred documents to be copied to a

single site. One might then be able to set up mechanisms to move seamlessly from a

search of annotated documents to a search of the whole Web.

8.4.4. Security and Extension

Anyone can access and browse the lattice to find information within the system.

However, for annotations, only staff and research students of the School of Computer

Science and Engineering, UNSW can annotate documents for research topics. The only

documents the system provides access to, are the home pages of these staff and students.

We use the local Unix account at the School to authenticate users for the annotation.

This also provides a default home page address of the users.

However, different applications (i.e., in the domain of the Banff Knowledge Acquisition

Proceedings papers) will require different types of authentication. There is a need to set

up a study in which there are no controls at all – anyone with access to the Web can set-

up and change annotations to any page on the WWW. We imagine this may be of

considerable interest to people using open discussion lists in various communities.

Further work would be related to the extension of the approach across other related

groups which potential collaborators belong to, or across communities which share the

same domain knowledge beyond a small department. Integration strategies of browsing

across the disparate communities with appropriate security will be an interesting issue.

183

8.5. Conclusion

We have presented a Web-based incremental document management and retrieval

system for small communities in specialised domains based on the concept lattice of

Formal Concept Analysis. The incremental approach we have used has a similar

motivation to both Ripple-Down Rules and Repertory Grids. FCA facilitates browsing

and users adding documents seem to enjoy seeing how their documents fit into the

lattice and making sure they are appropriately positioned. The users have a much more

flexible role, adding terms as the need arises.

The browsing structure that evolves in an ad hoc fashion evolved into a reasonable

consensus and provided good efficiency in retrieval performance. In general, lattice-

based browsing was considered by users as a more helpful search method than Boolean

queries and hierarchical browsing for searching a specialised domain. The experimental

results also supported the hypothesis that lattice-based browsing is more powerful than a

hierarchical approach. Users were satisfied with the system performance and the Web

interface for lattice-based browsing. An interesting result suggested that although an

established external taxonomy could be useful in proposing annotation terms, small

communities appeared to have little interest in adhering to standard taxonomies and

users appeared to be very selective in their use of terms proposed.

However, this does not mean that FCA solves all of the problems in managing and

retrieving documents for specialised domains. There are a number of areas requiring

further research particularly relating to ontologies and automated annotation support.

We conclude that the concept lattice of Formal Concept Analysis, supported by

annotation tools is a useful way of supporting the flexible open management of

documents and retrieval required by individuals and small communities in specialised

domains. It seems likely that this approach can be readily integrated with other

developments such as further improvements in search engines and the use of

semantically marked-up documents. This would result in a seamless linking of general

search to ad hoc ontologies through to established standard ontologies.

184

We also suggest that in the near term as standards for representing ontologies take hold,

these small community systems will paly an important role in helping to develop

standard ontologies. However, rather than being locked into conforming to the

standard, users will be free to use all, small fragments, or none of the ontology as best

suits their purpose; that is, these communities will be able to very flexibly import

ontologies and make selective use of ontology resources. Their selective use and the

extra terms they add will provide useful feedback on how the external ontologies could

be evolved. A new ontology will emerge as the result, and this itself may become a new

standard ontology.

185

Appendix

A.1 Retrieval Performance on the Queries in Table 7.11

Number of researcher home pages retrieved

Lattice only ACM Taxonomy ASIS&T

Taxonomy


UNSW Taxonomy

Terms

Retrieved Precision Retrieved Precision Retrieved Precision Retrieved Precision Retrieved Precision

Artificial -intelligence 39 1 50 0.78 45 0.87 56 0.70 58 0.67

Knowledge -engineering 3 1 - 32 0.09 - 30 0.10

Knowledge -representation

18 1 18 1.00 20 0.90 22 .82 22 0.82

Knowledge -management

4 1 - - 25 0.16 25 0.16

Knowledge -discovery

7 1 - - 24 0.29 24 0.29

Machine -learning

24 1 - 24 1.00 28 0.86 28 0.86

Learning 5 1 21 0.24 - - -

Information -processing

1 1 - 11 0.09 - -

Information -retrieval 10 1 11 0.91 10 1.00 11 0.91 11 0.91

Internet 4 1 4 1.00 6 0.67 6 0.67 6 0.67

Databases 11 1 11 1.00 12 0.92 22 0.50 13 0.85

Computer -programming

1 1 11 0.09 9 0.11 1 1.00 11 0.09

Programming -languages

4 1 4 1.00 4 1.00 4 1.00 4 1.00

Average Precision

1 0.75 0.58 0.69 0.58

An average decrease = 1 - (0.75 + 0.58 + 0.69 + 0.58) / 4 = 0.35.

186

A.2 Chi-Square ( ) Matrix for Table 7.17

Boolean query Hierarchical browsing

Lattice-based browsing

Total Search methods

unhelpful helpfu1 fo % fe fo % fe fo % fe Total %

1-2 5 12.5 2.0 1 2.5 2.0 0 0.0 2.0 6 5.0

3 9 22.5 8.7 13 32.5 8.7 4 10.0 8.7 26 22.0

4-5 26 65.0 29.3 26 65.0 29.3 36 90.0 29.3 88 73.0

Total 40 100.0 40.0 40 100.0 40.0 40 100.0 40.0 120 100.0

Note that the chi-square equation is where fo is the frequency obtained

and fe is the frequency expected in each cell under the assumption of no difference.

A.3 Critical Values of the Chi-Square Distribution used in Chapter 7

df .05 .01

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

3.841 5.991 7.815 9.488 11.070 12.592 14.067 15.507 16.919 18.307 19.675 21.026 22.362 23.685 24.996 26.587 27.587 28.869 30.144 31.410

6.635 9.210 11.345 13.277 15.086 16.812 18.475 20.090 21.666 23.209 24.725 26.217 27.688 29.141 30.578 31.000 33.409 34.805 36.191 37.566

2*χ 2*χ

2χ

22 ( )o e

e

f f

fχ −

=�

187

Bibliography

1. Anick, P. G. (1993). Integrating Natural Language Processing and Information Retrieval in a Troubleshooting Help Desk, IEEE Expert, December 1993, 9-17.

2. Aussenac-Gilles, N., Biebow, B. and Szulman, S. (2000). Revisiting Ontology Design: A Methodology Based on Corpus Analysis, 12th European Conference on Knowledge Acquisition and Knowledge Management (EKAW 2000), Springer, 172-188.

3. Barletta, R. (1993a). Case-based Reasoning and Information Retrieval: Opportunities for Technology Sharing, IEEE Expert, December 1993, 2-8.

4. Barletta, R. (1993b). Building a Case-based Help Desk Application, IEEE Expert, December 1993, 18-26.

5. Benjamins, V. R. and Fensel, D. (1998). Community is Knowledge! in (KA) � , Eleventh Workshop on Knowledge Acquisition, Modeling and Management (KAW’98), Banff, SRDG Publications, University of Calgary, KM-2, 1-18.

6. Benjamins, V. R., Fensel, D., Decker, S. and Perez, A. G. (1999). (KA) � : Building Ontologies for the Internet: a Mid-term Report, International journal of human computer studies, 51(3): 687-712.

7. Berners-Lee, T., Hendler, J. and Lasilla, O. (2001). The Semantic Web, Scientific American, 284(5): 34-43.

8. Beydoun, G. (2000). Incremental Knowledge Acquisition for Search Control Heuristics, Ph.D. Thesis, School of Computer Science and Engineering, University of New South Wales, Australia.

9. Beydoun, G. and Hoffmann, A. (1997). Acquisition of Search Knowledge, Proceedings of 10th European Workshop on Knowledge Acquisition (EKAW’97), Springer, 1-16.

10. Beydoun, G. and Hoffmann, A. (1998a). Building Problem Solvers Based on Search Control Knowledge, Eleventh Workshop on Knowledge Acquisition, Modeling and Management (KAW’98), Banff, SRDG Publications, University of Calgary, SHARE-1, 1-16.

11. Beydoun, G. and Hoffmann, A. (1998b). Simultaneous Modelling and Knowledge Acquisition using NRDR, 5th Pacific Rim Conference on Artificial Intelligence (PRICAI98), Singapore, Springer, 83-95.

12. Beydoun, G. and Hoffmann, A. (1999). Hierarchical Incremental Knowledge Acquisition, 12th Banff Knowledge Acquisition, Modelling and Management (KAW’99), Banff, Canada, SRDG Publication, University of Calgary, 7.2.1-20.

188

13. Bordat, J. P. (1986), Calcul pratique du Treill is de Galois d’une Correspondance, Mathematiques et Sciences Humaines, 96: 31-47.

14. Brüggemann, R., Schwaiger, J., and Negele, R. D. (1995). Applying Hasse Diagram Technique for the Evaluation of Toxicological Fish Tests, Chemosphere, 30(9): 1767-1780.

15. Brüggemann, R., Voigt, K., and Steinberg, C. (1997). Application of Formal Concept Analysis to Evaluate Environmental Databases, Chemosphere, 35(3): 479-486.

16. Brüggemann, R., Zelles, L., Bai, Q. Y., and Hartmann, A. (1995). Use of Hasse Diagram Technique for Evaluation of Phospholipid Fatty Acids Distribution in Selected Soils, Chemosphere, 30(7): 1209-1228.

17. Burke, R. D., Hammond, K. J., Kulyukin, V., Lytinen, S. L., Tomuro, N. and Schoenberg, S. (1997). Question Answering from Frequently Asked Question Files: Experiences with the FAQ FINDER System, AI Magazine, 18(2): 57-66.

18. Carpineto, C. and Romano, G. (1995). ULYSSES: A Lattice-based Multiple Interaction Strategy Retrieval Interface, Proceedings of EWHCI ’95, Moscow, Russia, 91-104.

19. Carpineto, C. and Romano, G. (1996a). A Lattice Conceptual Clustering System and Its Application to Browsing Retrieval, Machine Learning, 24(2): 95-122.

20. Carpineto, C. and Romano, G. (1996b). Information retrieval through hybrid navigation of lattice representations, International Journal of Human-Computer Studies, 45:553-578.

21. Carpineto, C. and Romano, G. (1998). Effective reformulation of boolean queries with concept lattices, Proceedings of the Third International Conference on Flexible Query Answering Systems, Roskilde, DK, 83-94.

22. Charikar, M., Chekuri, C., Feder, T. and Motwani, R. (1997). Incremental Clustering and Dynamic Information Retrieval, Proceedings of the 29th Symposium on Theory of Computing, 626-635.

23. Chein, M. (1969). Algorithme de recherche des sous-matrices premieres d’une matrice, Bulletin Math. Soc. Sci. Math. R.S. Roumanie, 13:21-25.

24. Chen, M., Busco, J. D., Garrett, K. and Sinha, A. (2000). Search Engine Usage. At:http://www.sims.berkeley.edu/~sinha/teaching/Infosys271_2000/SearchEngin/.

25. Clancey, W. J. (1993a). Situated Action: A Neuropsychological Interpretation Response to Vera and Simon, Cognitive Science, 17(1): 87-116.

26. Clancey, W. J. (1993b). The Knowledge Level Reinterpreted: Modeling socio-technical systems, International Journal of Intelligent Systems, 8(1): 33-49.

189

27. Clancey, W. J. (1997). Situated Cognition: On Human Knowledge and Computer Representation, Cambridge University Press, USA.

28. Cole, R. and Eklund, P. (1996a). Application of Formal Concept Analysis to Information Retrieval using a Hierarchically Structured Thesaurus, International Conference on Conceptual Graphs (ICCS’96), Sydney, University of New South Wales, 1-12.

29. Cole, R. and Eklund, P. (1996b). Text Retrieval for Medical Discharge Summaries using SNOMED and Formal Concept Analysis, Australian Document Computing Symposium, 50-58.

30. Cole, R. and Eklund, P. (2001). Browsing Semi-structured Web texts using Formal Concept Analysis, Conceptual Structures: Broadening the Base, Proceedings of the 9th International Conference on Conceptual Structures (ICCS 2001), Stanford, Springer, 290-303.

31. Cole, R., Eklund, P. and Stumme, G. (2000). CEM - A Program for Visualization and Discovery in Email, Proceedings of the Fourth European on Principles and Practice of Knowledge Discovery in Databases (PKDD’00), Springer, 367-374.

32. Cole, R. and Stumme, G. (2000). CEM - A Conceptual Email Manager, Conceptual Structures: Logical, Linguistic, and Computational Issues, Proceedings of the 8th International Conference on Conceptual Structures (ICCS 2000), Darmstadt, Springer, 438-452.

33. Compton, P., Edwards, G., Kang, B., Lazarus, L., Malor, R., Menzies, T., Preston, P., Srinivasan, A. and Sammut, C. (1991). Ripple Down Rules: Possibilities and Limitations, 6th Banff AAAI Knowledge Acquisition for Knowledge Based Systems Workshop, Banff, Canada.

34. Compton, P., Horn, K., Quinlan, J. R., Lazarus, L. and Ho, K. (1989). Maintaining an Expert System, In J. R. Quinlan (Eds.), Application of Expert Systems, London, Addition Wesley, 366-385.

35. Compton, P. and Jansen, R. (1990). A Philosophical Basis for Knowledge Acquisition, Knowledge Acquisition, 2:241-257.

36. Compton, P., Preston, P. and Kang, B. (1995). The Use of Simulated Experts in Evaluating Knowledge Acquisition. Proceedings of the 9th AAAI-Sponsored Banff Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, Canada, University of Calgary, 12:1-18.

37. Compton, P., Ramadan, Z., Preston, P., Le-Gia, T., Chellen, V. and Mullholland, M. (1998). A Trade-off Between Domain Knowledge and Problem-Solving Method Power, 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW’98), Banff, SRDG Publications, University of Calgary, SHARE-17, 1-19.

190

38. Compton, P. and Richard, D. (1999). Extending Ripple Down Rules, Proceedings of the 4th Australian Knowledge Acquisition Workshop (AKAW 99), University of New South Wales, Sydney, 87-101.

39. Conklin, J. (1987). Hypertext: an Introduction and Survey, IEEE Computer, 20:17-41.

40. Croft, W. B. (1978). Organizing and Searching Large Files of Documents, Ph.D. Thesis, University of Cambridge, UK.

41. Cutting, D. R., Karger, D. R. and Pederson, J. O. (1992). Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections, Proceedings of the 15th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’92), 318-329.

42. Cutting, D. R., Karger, D. R. and Pederson, J. O. (1993). Constant Interaction-time Scatter/Gather Browsing of Very Large Document Collections, Proceedings of the 16th Annual International ACM SIGIR Conference, 126-135.

43. Davis, R., Shrobe, H. and Szolovits, P. (1993). What is a knowledge representation?, AI Magazine, Spring 1993, 17-33.

44. Ding, Y., Fensel, D., Klein, M. and Omelayenko, B. (2002). The Semantic Web: Yet Another Hip?, Data and Knowledge Engineering, 41(3): 205-227.

45. Dowling, C. E. (1993). On the Irredundant Generation of Knowledge Spaces, J. Math. Psych., 37(1): 49-62.

46. Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis, John Wiley and Sons, NY.

47. Edwards, G., Compton, P., Malor, R., Srinivasan, A. and Lazarus, L. (1993). PEIRS: a Pathologist Maintained Expert System for the Interpretation of Chemical Pathology Reports, Pathology, 25:27-34.

48. Eklund, P., Groh, B., Stumme, G. and Wille, R. (2000). A Contextual-Logic Extension of TOSCANA, Conceptual Structures: Logical, Linguistic, and Computational Issues, Proceedings of the 8th International Conference on Conceptual Structures (ICCS 2000), Darmstadt, Springer, 453-467.

49. Erdmann, E. (1998). Formal Concept Analysis to Learn from the Sisyphus-III Material, Eleventh Workshop on Knowledge Acquisition, Modeling and Management (KAW’98), Banff, SRDG Publications, University of Calgary, SIS-2, 1-14.

50. Faloutsos, C. and Oard, D. W. (1995). A Survey of Information Retrieval and Filtering Methods, Technical Report CS-TR-3514, Department of Computer Science, University of Maryland, College Park.

At: http://www.enee.umd.edu/medlab/filter/papers/survey.ps.

191

51. Farquhar, A., Fikes, R. and Rice, J. (1997). The Ontolingua server: a tool for collaborative ontology construction, International Journal of Human-Computer Studies, 46(6): 707-727.

52. Fensel, D. and Musen, M. A. (2001). The Semantic Web: A Brain for Humankind, Guest Editions’ Introduction, IEEE Intelligent Systems, 16(2): 24-25.

53. Frakes, W. and Baeza-Yates, R. (1992). Information Retrieval Data Structure and Algorithms, Prentice-Hall.

54. Furnas, G. W. (1986). Generalized fisheye views, Proceedings of the Human Factors in Computing Systems, North Holland, 16-23.

55. Furnas, G. W., Landauer, T. K., Gomez, L. M. and Dumais, S. T. (1983). Statistical semantics: analysis of the potential performance of key-word information systems, Bell System Technical Journal, 62:1753-1806.

56. Gaines, B. (1993). Modeling as Framework for Knowledge Acquisition Methodologies and Tools, International Journal of Intelligent Systems, 8:155-168.

57. Gaines, B. and Shaw, M. L. G. (1989). Comparing the Conceptual Systems of Experts, The 11th International Joint Conference on Artificial Intelligence: 633-638.

58. Gaines, B. and Shaw, M. L. G. (1990). Cognitive and Logical Foundation of Knowledge Acquisition, The 5th Knowledge Acquisition for Knowledge Based Systems Workshop, Banff, 9:1-25.

59. Gangemi, A., Guarino, N., Masolo, C., Oltramari, A. and Schneider, L. (2002). Sweetening Ontologies with DOLCE, 13th International Conference on Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web (EKAW 2002), Sigüenza, Spain, Springer, 166-181.

60. Ganter, B. (1984). Two Basic Algorithms in Concept Analysis, FB4-Preprint No. 831, TH Darmstadt.

61. Ganter, B. and Kuznetsov, S. (1998). Stepwise Construction of the Dedekind-McNeille Completion, Conceptual Structures: Theory, Tools and Applications, Proceedings of the 6th International Conference on Conceptual Structures (ICCS’98), Montpellier, Springer, 295-302.

62. Ganter, B. and Reuter, K. (1991). Finding All Closed Sets: A General Approach, Order, 8:283-290.

63. Ganter, B. and Wille, R. (1989). Conceptual Scaling, In: F. Roberts (ed.): Application of Combinatorics and Graph Theory to the Biological and Social Sciences, Springer, 139-167.

64. Ganter, B. and Wille, R. (1999). Formal Concept Analysis: Mathematical Foundations, Springer, Heidelberg.

192

65. Genesereth, M. R. and Nilsson, N. J. (1987). Logical Foundation of Artificial Intelligence, Morgan Kaufmann, Los Altos, California.

66. Godin, R. and Missaoui, R. (1994). An Incremental Concept Formation Approach for Learning from Databases, Theoretical Computer Science, 133(2): 387-419.

67. Godin, R., Missaoui, R. and Alaoui, H. (1995). Incremental concept formulation algorithms based on Galois (concept) lattices, Computational Intelligence, 11(2): 246-267.

68. Godin, R., Missaoui, R. and April, A. (1993). Experimental Comparison of Navigation in a Galois Lattice with Conventional Information Retrieval Methods, International Journal of Man-Machine Studies, 38:747-767.

69. Godin, R., Saunders, E. and Gecsei, J. (1986). Lattice model of Browsable Data Spaces, Information Science, 40:89-116.

70. Groh, B. and Eklund, P. (1999). Algorithms for Creating Relational Power Context Families from Conceptual Graphs, Conceptual Structures: Standards and Practices, Proceedings of the 7th International Conference on Conceptual Structures (ICCS’99), Springer, 389-400.

71. Groh, G., Strahringer, S. and Wille, R. (1998). TOSCANA-Systems Based on Thesauri, Conceptual Structures: Theory, Tools and Applications, Proceedings of the 6th International Conference on Conceptual Structures (ICCS’98), Springer, 127-138.

72. Grosskopf, A. and Harras, G. (1998). A TOSCANA-system for speech-act verbs, FB4-Preprint, TU Darmstadt.

73. Gruber, T. (1993). A Translation Approach to Portable Ontology Specifications, Knowledge Acquisition, 5(2):199-220.

74. Gruber, T. (1995). Toward Principles for the Design of Ontologies Used for Knowledge Sharing, International Journal of Human and Computer Studies, 43(5/6): 907-928.

75. Guarino, N. (1995). Formal Ontology, Conceptual Analysis and Knowledge Representation, International Journal of Human and Computer Studies, 43(5/6): 625-640.

76. Guarino, N. (1997). Understanding, Building, and Using Ontologies: A Commentary to “Using Explicit Ontologies in KBS Development” , by van Heijst, Schreiber, and Wielinga, International Journal of Human and Computer Studies, 46:293-310.

77. Guarino, N. (1998). Formal Ontology in Information Systems, Proceedings of Formal Ontology and Information Systems (FOIS’98), Trento, Italy, IOS Press, Amsterdam, 3-15.

193

78. Guarino, N. and Welty, C. (2000). Ontological Analysis of Taxonomic Relations, Proceedings of ER-2000: The 19th International Conference on Conceptual Modeling, Springer, 210-224.

79. Handschuh, S., Staab, S. and Ciravegna, F. (2002). S-CREAM - Semi-automatic CREAtion of Metadata, 13th International Conference on Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web (EKAW 2002), Sigüenza, Spain, Springer, 358-372.

80. Harper, B., Slaughter, L. and Norman, K. (1997). Questionnaire Administration via the WWW: A Validation and Reliability Study for a User Satisfaction Questionnaire, Proceedings of WebNet 97: International Conference on the WWW. At: http://lap.umd.edu/quis/publications/ harper1997.pdf.

81. He, J. (1998). Search Engines on the Internet, Experimental Techniques, 22(1): 34-38.

82. Hearst, M. and Pedersen, J. (1996). Reexamining the Cluster Hypothesis: Scatter/Gather on Retrieval Results, Proceedings of the 19th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’96), 76-84.

83. Heflin, J. (2001). Towards the Semantic Web: Knowledge Representation in a Dynamic, Distributed Environment, Ph.D. Thesis, University of Maryland, College Park.

84. Hereth, J., Stumme, G., Wille, R. and Wille, U. (2000). Conceptual Knowledge Discovery and Data Analysis, Conceptual Structures: Logical, Linguistic, and Computational Issues, Proceedings of the 8th International Conference on Conceptual Structures (ICCS 2000), Darmstadt, Springer, 421-437.

85. Horrocks, I. (2002). DAML+OIL: a Reason-able Web Ontology Language, In Proceedings of the Eighth Conference on Extending Database Technology (EDBT 2002), Prague, Springer, 2-13.

86. Kang, B. H., Compton, P. and Preston, P. (1995). Multiple Classification Ripple Down Rules: Evaluation and Possibilities, Proceedings of the 9th AAAI-Sponsored Banff Knowledge Acquisition for Knowledge-Based Systems Workshop, Banff, Canada, University of Calgary, 17:1-20.

87. Kang, B., Compton, P. and Preston, P. (1998). Simulated Expert Evaluation of Multiple Classification Ripple Down Rules, 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW’98), Banff, SRDG Publications, University of Calgary, EVAL-4, 1-19.

88. Kang, B. H., Yoshida, K., Motoda, H. and Compton, P. (1997). Help Desk System with Intelligent Interface, Applied Artificial Intelligence, 11:611-631.

194

89. Katz, B. (1997). From Sentence Processing to Information Access on the World Wide Web, AAAI Spring Symposium on Natural Language Processing for the World Wide Web, Stanford University, 77-94.

90. Kim. M. (1999). Incremented Development of a Web Based Help Desk System, Project report for a course work master's degree, School of Computer Science and Engineering, University of New South Wales, Australia.

91. Kim, M. and Compton, P. (2000). Developing a Domain-Specific Information Retrieval Mechanism, Proceedings of the 6th Pacific Knowledge Acquisition Workshop (PKAW 2000), Sydney Australia, 189-206.

92. Kim, M. and Compton, P. (2001a). A Web-based Browsing Mechanism Based on the Conceptual Structures, Conceptual Structures: Extracting and Representing Semantics, Contributions to the Proceedings of the 9th International Conference on Conceptual Structures (ICCS 2001), Stanford University, 47-60.

93. Kim, M. and Compton, P. (2001b). Incremental Development of Domain-Specific Document Retrieval Systems, First International Conference of Knowledge Capture (K-CAP 2001): Workshop on Knowledge Markup and Semantic Annotation, Victoria, Canada, 69-77.

94. Kim, M. and Compton, P. (2001c). Formal Concept Analysis for Domain-Specific Document Retrieval Systems, AI 2001: Advances in Artificial Intelligence: 14th Australian Joint Conference on Artificial Intelligence (AI’01), Springer, 237-248.

95. Kim, M. and Compton, P. (2002a). Web-Based Document Management for Specialised Domains, EKAW 2002: Workshop on Knowledge Management through Corporate Semantic Webs, Sigüenza, Spain, 37-51.

96. Kim, M. and Compton, P. (2002b). Web-Based Document Management for Specialised Domains: a Preliminary Evaluation, 13th International Conference on Knowledge Engineering and Knowledge Management: Ontologies and the Semantic Web (EKAW 2002), Sigüenza, Spain, Springer, 43-48.

97. Kim, M., Compton, P. and Kang, B. H. (1999). Incremented Development of a Web Based Help Desk System, Proceedings of the 4th Australian Knowledge Acquisition Workshop (AKAW 99), University of NSW, Sydney, 13-29.

98. Kogut, P. and Holmes, W. (2001). AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages, First International Conference on Knowledge Capture (K-CAP 2001): Workshop on Knowledge Markup and Semantic Annotation, Victoria, Canada, 111-113.

99. Kollewe, W., Skorsky, M., Vogt, F. and Wille, R. (1994). TOSCANA - ein Werkzeug zur begrifflichen Anlayse und Erkundung von Daten, Begriffliche Wissensver-arbeitung: Grundfragen und Aufgaben, 267-288.

100. Kowalski, G. (1997). Information Retrieval Systems: Theory and Implementation, Kluwer Academic Publishers.

195

101. Kuter, U. and Yilmaz, C. (2001). Survey Methods: Questionnaires and Interviews, Department of Computer Science, University of Maryland, College Park, USA. At: http://www.otal.umd.edu/hci-rm/survey.html.

102. Kuznetsov, S. O. (1993). A Fast Algorithms for Computing All Intersections of Objects in a Finite Semi-lattice, Automatic Documentation and Mathematical Linguistics, 27(5): 11-21.

103. Kuznetsov, S. O. and Ob’edkov, S. A. (2001). Comparing Performance of Algorithms for Generating Concept Lattices, International Workshop on Concept Lattices-Based Theory, Methods and Tools for Knowledge Discovery in Databases (CLKDD’01) in ICCS 2001, Stanford University, Eds. Nguifo, E. M. et al., 35-47.

104. Landauer, T. K., Dumais, S. T., Gomez, L. M. and Furnas, G. W. (1982). Human Factors in Data Access, Bell System Technical Journal, 61:2487-2509.

105. Laresgoiti, I., Anjewierden, A., Bernaras, A., Corera, J., Schreiber, A. Th., Wielinga, B. J. (1996). Ontologies as Vehicles for Reuse: a mini-experiment, 10th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW’96), Banff, Canada, 30:1-21.

106. Lenat, D. B. (1995). CYC: A Large-Scale Investment in Knowledge Infrastructure, Communications of the ACM, 38(11): 33-38.

107. Lenat, D. B. and Guha, R. V. (1990). Building Large Knowledge-Based Systems: Representation and Inference in the CYC Project, Reading, Mass: Addison-Wesley.

108. Leouski, A. V. and Croft, W. B. (1996). An Evaluation of Techniques for Clustering Search Results, Technical Report IR-76, Department of Computer Science, University of Massachusetts. At: http://ciir.cs.umass.edu/pubfiles/ir-76.pdf.

109. Li, J., Pease, A. and Barbee, C. (2002). Experimenting with ASCS Semantic Search. At: http://reliant.teknowledge.com/DAML/DAMP.ps/.

110. Lin, X. (1997). Map Displays for Information Retrieval, Journal of the American Society of Information Science, 48:40-54.

111. Lindig, C. (1995). Concept-Based Component Retrieval, In: Working Notes of the IJCAI-95 Workshop: Formal Approaches to the Reuse of Plans, Proofs, and Programs, Montreal. At: http://www.cs.tu-bs.de/softech/papers/ijcai-l indig.html.

112. Lindig, C. (1999). Algorithmen zur Begriffsanalyse und ihre Anwendung bei Softwarebibliotheken, Dissertation, Technical University of Braunschweig, Germany. At: http://www.gaertner.de/~lindig/papers/diss/.

196

113. Lindig, C. and Snelting, G. (2000). Formale Begriffsanalyse in Software Engineering, In Stumme, G., Wille, R. (Eds.): Begriffliche Wissensverarbeitung: Methoden und Anwendungen, Springer, 151-175.

114. Maedche, A. and Staab, S. (2000). Mining Ontologies from Text, 12th European Conference on Knowledge Acquisition and Knowledge Management (EKAW 2000), Springer, 189-202.

115. Marchionini, G. and Shneiderman, B. (1988). Finding Facts vs. Browsing Knowledge in Hypertext Systems, IEEE Computer, 21:70-80.

116. Martinez-Bejar, R., Benjamins, R., Compton, P., Preston, P. and Martin-Rubio, F. (1998). A Formal Framework to build Domain Knowledge Ontologies for Ripple-Down Rules-based Systems, 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW’98), Banff, Canada, SRDG Publications, SHARE 13, 1-20.

117. Martinez-Bejar, R., Shiraz, G. M. and Compton, P. (1998). Using Ripple Down Rules-based Systems for Acquiring Fuzzy Domain Knowledge, 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW’98), Banff, Canada, SRDG Publications, KAT-2, 1-20.

118. Martinez-Bejar, M., Ibanez-Cruz, F. and Compton, P. (1999). A Reusable Framework for Incremental Knowledge Acquisition, Proceedings of the 4th Australian Knowledge Acquisition Workshop (AKAW 99), University of NSW, Sydney, 157-171.

119. McGuinness, D. L. (2000). Conceptual Modelling for Distributed Ontology Environments, Conceptual Structures: Logical, Linguistic, and Computational Issues, Proceedings of the 8th International Conference on Conceptual Structures (ICCS 2000), Springer, 100-112.

120. Mineau, G., Stumme, G. and Wille, R. (1999). Conceptual Structures represented by Conceptual Graphs and Formal Concept Analysis, Proceedings of the 7th International Conference on Conceptual Structures (ICCS’99), Blacksburg, Springer, 423-441.

121. Musen, M. (1992). Dimensions of Knowledge Sharing and Reuse, Computers and Biomedical Research, 25:435-467.

122. Nikolai, R. (1999). Semi-Automatic Thesaurus Integration: Does it work?, FZI Karlsruhe, Preprint.

123. Norris, E. M. (1978). An Algorithm for Computing the Maximal Rectangles in a Binary Relation, Revue Roumaine de mathematiques Pures et Alliquees, 23(2): 243-250.

124. Nourine, L. and Raynaud, O. (1999). A Fast Algorithm for Building Lattices, Information Processing Letters, 71:199-204.

197

125. Noy, N. F. and Hafner, C. (1997). The State of the Art in Ontology Design: A Survey and Comparative Review, AI Magazine, 18(3): 53-74.

126. Oddy, R. N. (1977). Information Retrieval Through Man-Machine Dialogue, Journal of Documentation, 33:1-14.

127. Peirce, Ch. S. (1931). Collected Papers of Charles Standers Peirce, Harvard University Press, Cambridge.

128. Perlman, G. (1997). Web-Based User Interface Evaluation with Questionnaires. At: http://www.acm.org/~perlman/question.html.

129. Pirlein, T. and Studer, R. (1995). An Environment for Reusing Ontologies within a Knowledge Engineering Approach, International Journal of Human-Computer Studies, 43(5-6): 945-965.

130. Platt, N. (1998). Search Engines for Intranets. At: http://www.llrx.com/features/ nina.htm/.

131. Priss, U. (2000a). Faceted Information Representation, In: Stumme, Gerd (ed.), Working with Conceptual Structures, Contributions to Proceedings of the 8th International Conference on Conceptual Structures (ICCS 2000), Shaker Verlag, Achene, 84-94.

132. Priss, U. (2000b). Lattice-based Information Retrieval, Knowledge Organisation, 27(3): 132-142.

133. Rho, Y. and Gedeon, T. D. (2000). Academic Articles on the Web: Reading Patterns and Formats, International Journal of Human-Computer Interaction, Special Issue on Empirical Studies of WWW Usability, 12(2), 221-242.

134. Richards, D. (1998). The Reuse of Knowledge in Ripple Down Rule Knowledge Based Systems, Ph.D. Thesis, School of Computer Science and Engineering, University of New South Wales, Australia.

135. Richards, D. and Compton, P. (1997a). Knowledge Acquisition first, Modelling later, Proceedings of the 10th European Workshop on Knowledge Acquisition, Modelling and Management (EKAW’97), Springer, 237-252.

136. Richards, D. and Compton, P. (1997b). Combining Formal Concept Analysis and Ripple Down Rules to Support Reuse, Software Engineering and Knowledge Engineering (SEKE’97), Springer, 177-184.

137. Richards, D. and Compton, P. (1999). An Alternative Verification and Validation Technique for an Alternative Knowledge Representation and Acquisition Technique, Knowledge-Based Systems, 12:55-73.

138. Rock, T. and Wille, R. (2000). TOSCANA-System zur Literatursuche, In: G. Stumme and R. Wille (eds.): Begriffliche Wissensverarbeitung: Methoden und Anwendungen, Springer, 239-253.

198

139. Rousseau, G. K., Jamieson, B. A., Rogers, W. A., Mead, S. E. and Sit, R. A. (1998). Assessing the usability of on-line library systems, Behaviour and Information Technology, 17: 274-281.

140. Salton, G. and McGill, M. J. (1983). Introduction to Modern Information Retrieval, McGraw-Hill Book Company, New York, NY.

141. Schreiber, G., Wielinga, B., and Breuker, J. (1993). KADS: A Principled Approach to Knowledge-Based System Development, Academic Press, London.

142. Shaw, M. L. G. (1988). Validation in a Knowledge Acquisition System with Multiple Experts, Proceedings of the International Conference on Fifth Generation Computer Systems (FGCS 1988), Tokyo, Japan, Springer, 1259-1266.

143. Shiraz, G. M. and Sammut, C. (1997). Combining Knowledge Acquisition and Machine Learning to Control Dynamic Systems, Proceedings of 15th International Joint Conferences on Artificial Intelligence (IJCAI’97), Nagoya Japan, Morgan Kaufmann, 908-913.

144. Shiraz, G. M. and Sammut, C. (1998). Acquiring Control Knowledge from Examples Using Ripple-down Rules and Machine Learning, 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW’98), Banff, Canada, SRDG Publications, KAT-5, 1-17.

145. Simoudis, E. and Miller, J. (1991). The Applicability of CBR to Help Desk Applications, Proceedings of the Case-Based Reasoning Workshop, Morgan Kaufmann, 25-36. At: http://online.loyno.edu/cisa494/papers/ Simoudis.html.

146. Slaughter, L., Harper, B. and Norman, K. (1994). Assessing the Equivalence of the Paper and On-line Formats of the QUIS 5.5, Proceedings of the 2nd Annual Mid-Atlantic Human Factors Conference, Washington, 87-91.

147. Snelting, G. (1996). Reengineering of Configurations Based on Mathematical Concept Analysis, ACM - Transactions on Software Engineering and Methodology, 5(2): 146-189.

148. Snelting, G. (2000). Software Reengineering Based on Concept Lattices, Proceedings of the 4th European Conference on Software Maintenance and Reengineering (CSMR’00), IEEE Computer Society, 3-10.

149. Sowa, J. F. (2000). Knowledge Representation: Logical, Philosophical, and Computational Foundations, Brooks/Cole.

150. Spangenberg, N., Fischer, R., and Wolff, K. E. (1999). Towards a methodology for the exploration of “ tacit structures of knowledge” to gain access to personal knowledge reserve of psychoanalysis: the example of psychoanalysis versus psychotherapy, In: N.Spangenberg, K.E. Wolff (eds.): Psychoanalytic research by means of formal concept analysis, Special des Sigmund-Freud-Instituts, Lit Verlag, Münster.

199

151. Spangenberg, N. and Wolff, K. E. (1991). Comparison between biplot analysis and formal concept analysis of repertory grids, In Classification, data analysis, and knowledge organization, Springer, 104-112.

152. Staab, S., Angele, J., Decker, S., Erdmann, M., Hotho, A., Maedche, A., Schnurr, H. P., Studer, R., Sure, Y. (2000). Semantic Community Web Portals, Proceedings of the 9th International World Wide Web Conference (WWW9), Amsterdam, The Netherlands, 474-491.

153. Strahringer, S. and Wille, R. (1993). Conceptual clustering via convex-ordinal structures, Information and classification, Springer, 85-98.

154. Studer, R., Benjanins, V. R. and Fensel, D. (1998). Knowledge Engineering: Principles and Methods, Data and Knowledge Engineering, 25(1-2): 161-197.

155. Stumme, G. (1998). Distributed Concept Exploration - A Knowledge Acquisition Tool in Formal Concept Analysis, In: O. Herzog, A. Günter (eds.): KI-98: Advances in Artificial Intelligence, Springer, 117-128.

156. Stumme, G. (1999). Hierarchies of Conceptual Scales, 12th Banff Knowledge Acquisition, Modelling and Management (KAW’99), Banff, Canada, SRDG Publication, University of Calgary, 5.5.1-18.

157. Stumme, G., Taouil, R., Bastide, Y., Pasquier, N. and Lakhal, L. (2000). Fast Computation of Concept Lattices Using Data Mining Techniques, Proceedings of 7th International Workshop on Knowledge Representation Meets Databases (KRDB 2000), 129-139.

158. Stumme, G., Wille, R. and Wille, U. (1998). Conceptual Knowledge Discovery in Databases Using Formal Concept Analysis Methods, Principles of Data Mining and Knowledge Discovery, Proceedings of the 2nd European Symposium on PKDD’98, LNAI 1510, Springer, 450-458.

159. Sullivan, D. (2000). Search Engine Software for Your Web Site. At: http://searchenginewatch.internet.com/resources/software.html.

160. Suryanto, H. and Compton, P. (2000). Discovery of Class Relations in Exception Structured Knowledge Bases, Conceptual Structures: Logical, Linguistic, and Computational Issues, Proceedings of the 8th International Conference on Conceptual Structures (ICCS 2000), Darmstadt, Springer, 113-126.

161. Suryanto, H. and Compton, P. (2001). Discovery of Ontologies from Knowledge Bases, Proceedings of the First International Conference on Knowledge Capture (K-CAP 2001), The Association for Computing Machinery, New York, 171-178.

162. Suryanto, H., Richards, D. and Compton, P. (1999). The Automatic Compression of Multiple Classification Ripple Down Rule Knowledge Based Systems: Preliminary Experiments, Proceedings of the Third International Conference on Knowledge-Based Intelligent Information Engineering Systems, Adelaide, South Australia, IEEE Service Centre, 203-206.

200

163. Thompson, R. H. and Croft, B. (1989). Support for Browsing in an Intelligent Text Retrieval System, International Journal of Man-Machine Studies, 30:639-668.

164. Turtle, H. and Croft, B. (1991). Evaluation of an Inference Network-Based Retrieval Model, ACM Transactions on Information Systems, 9:187-222.

165. Uschold, M. (2002). Where are the Semantics in the Semantic Web?, To appear in the AI magazine in 2002 (http://lsdis.cs.uga.edu/events/Uschold-talk.htm). At: http://lsdis.cs.uga.edu/SemWebCourse_files/WhereAreSemantics-AI-Mag-FinalSubmittedVersion2.pdf.

166. Valente, A. and Breuker, J. (1996). Towards Principled Core Ontologies, 10th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop (KAW’96), Banff, Canada, University of Calgary, 33:1-20.

167. Valtchev, P. and Missaoui, R. (2001). Building Concept (Galois) Lattices from Parts: Generalizing the Incremental Methods, Conceptual Structures: Broadening the Base, Proceedings of the 9th International Conference on Conceptual Structures (ICCS 2001), Springer, 290-303.

168. van Heijst, G., Schreiber, A. T. and Wielinga, B. J. (1997). Using Explicit Ontologies in KBS Development, International Journal of Human and Computer Studies, 46:183-292.

169. van Rijsbergen, C. J. (1979). Information Retrieval, Butterworths, London.

170. Vogt, F., Wachter, C. and Wille, R. (1991). Data analysis based on a conceptual file, In: H.-H. Bock und P. Ihm (Hrsg.), Classification, data analysis, and knowledge organization, Springer, 131-140.

171. Vogt, F. and Wille, R. (1995). TOSCANA - A graphical tool for analyzing and exploring data, In: R. Tamassia, I. G. Tollis (eds.): GraphDrawing ’94, LNCS 8945, Springer, 226-233.

172. Wille, R. (1982). Restructuring lattice theory: an approach based on hierarchies of concepts, In: Ivan Rival (ed.), Ordered sets, Reidel, Dordrecht-Boston, 445-470.

173. Wille, R. (1992). Concept lattices and conceptual knowledge systems, Computers and Mathematics with Applications, 23:493-515.

174. Wille, R. (1997). Conceptual Graphs and Formal Concept Analysis, Conceptual Structures: Fulfi lling Peirce’s Dream, Proceedings of the 5th International Conference on Conceptual Structures (ICCS’97), Springer, 290-303.

175. Wille, R. (2001). Why can Concept Lattice Support Knowledge Discovery in Databases?, International Workshop on Concept Lattices-Based Theory, Methods and Tools for Knowledge Discovery in Databases (CLKDD’01) in ICCS 2001, Stanford University, Eds. Nguifo, E. M. et al., 7-20.

201

176. Zamir, O. and Etzioni, O. (1998). Web document clustering: a feasibility demonstration, In Proceedings of the 21th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’98), 46-54.

177. Zamir, O. and Etzioni, O. (1999). Grouper: A Dynamic Clustering Interface to Web Search Results, Computer Networks, 31(11-16): 1361-1374.