![Page 1: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/1.jpg)
1
Automating Ontology Building:
Ontologies for the Semantic Web and
Knowledge Management
Christopher BREWSTERDepartment of Computer Science,
University of Sheffield
www.dcs.shef.ac.uk/~kiffer
![Page 2: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/2.jpg)
2
Outline
• The Need for Ontologies and Taxonomies• Problems with Knowledge Acquisition• Methodological Criteria
– Coherence– Multiplicity– Ease of Computation– Labels– Data Sources
• Linking/ associating terms• Constructing Hierarchies• Labelling Relations• Conclusions
![Page 3: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/3.jpg)
3
The Need
• Ontologies and Taxonomies are needed for:– Ontologies for the Semantic Web
• Central component for ‘agent services’ over the Web
– Knowledge acquisition for knowledge management
• Minds of employees = Intangible assets• Ontologies act as “index to memory of an organisation”• Many organisations have built or are building their own
ontologies/taxonomies (e.g. BBC, British Council, Clifford Chance, etc.)
![Page 4: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/4.jpg)
4
The Need (2)
– Navigational Aids e.g. Yahoo, Northern Lights, corporate intranets, …
– Component in LT systems– etc.
– BUT complex hand-built taxonomies/ ontologies such as Microkosmos, Cyc, WordNet, etc. are not used in applications!
![Page 5: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/5.jpg)
5
Problems: Knowledge Representation
• Widely-held Assumption: knowledge can be codified in an ontology– Ontology = “formal explicit specification of a
shared conceptualisation” (Gruber)
= “ a document or file that formally defines the relations among terms” (Berners-Lee)
![Page 6: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/6.jpg)
6
Problems: Knowledge Representation (2)
Ontologies Taxonomies
Pathfinder Networks, Mindmaps
Degree of formality, potential for inference MORE LESS
Continuum:Ontologies Taxonomies Other Semantic Networks
Differences lie in degree of logical rigour, formality and the potential for reasoning over the data structure
![Page 7: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/7.jpg)
7
Problems: Ontology/ Taxonomy Construction
• Current focus: formal criteria e.g. “consistency, completeness & conciseness” (e.g. Gomez-Perez, Guarino)
• Idealised aspirations similar to those in lexicography
• Common assumption: users will willingly contribute to construction of a formal ontology (e.g. Stutt & Motta)– Reality: both librarian and companies know authors tag
their texts inappropriately
![Page 8: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/8.jpg)
8
Manual Labour
• All current ontologies/ taxonomies are hand built– Yahoo, Northern Lights (browsable taxonomy)– Gene Ontology– Company internal (e.g. Arthur Andersen,
DaimlerChrysler)• Computers cannot be relied on.
• Some are mergers of existing taxonomies– Company merger ontology merger (e.g.
GlaxoSmithKline)
![Page 9: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/9.jpg)
9
Specific Issues (1)1. High cost of human labour in initial
development / editorial task– Category construction– Content association
2. Knowledge is in continuous flux: ‘out of date on day of publication’
3. Ontologies/Taxonomies need to be domain specific
– General ontologies not very helpful without a lot of work
![Page 10: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/10.jpg)
10
Specific Issues (2)
4. Ontologies/Taxonomies reflect a particular perspective on the world e.g. categories like “business opportunity”
5. Categories are abstractions, derived from an analytic frame work e.g. ‘nouns’ or ‘business opportunity’
6. Ontologies = “shared conceptualisations” but often very difficult for human being to agree on categorising the world (e.g. problems with global ‘standards’)
![Page 11: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/11.jpg)
11
Contradictions
• Problems 1-3 imply need for automated construction
• Problems 4-6 imply impossibility of such an approach.
• Ontology construction involves judicious integration of of automated methods with manual validation
![Page 12: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/12.jpg)
12
Data Sources• Traditionally and currently:
– Protocol analysis– Introspection
• Both slow• Both subjective• Both very costly
• Future:– Automated Text/Corpus analysis
• Information Extraction from texts• Automated ontology building must be based on texts,
since we cannot enter people’s minds
• Further in Future– Integration with generated dialogue ….
![Page 13: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/13.jpg)
13
Methodological Criteria
• Set of criteria to:– Guide choice and development of tools
and algorithms– Contribute to evaluation of ontology
construction methodologies by going beyond idealised abstract criteria
![Page 14: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/14.jpg)
14
1. Coherence• The algorithm(s) must produce output
coherent for the user• Coherence = appears to user as reflecting
common sense i.e. ‘shared conceptualisation’ of Gruber– Linguistic coherence encyclopaedic coherence
• Tennis problem in Wordnet
• Very difficult to evaluate– No criteria for ‘degree of correctness’– Easy to spot algorithms which produce rubbish– Help!
![Page 15: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/15.jpg)
15
2. Multiplicity• Algorithm(s) must allow for multiple
placement of the same term in the ontology
• Multiplicity semantic ambiguity– ‘cat’ ISA ‘mammal’– ‘cat’ ISA ‘ pet’– Classic problem in librarianship
![Page 16: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/16.jpg)
16
3. Ease of Computation• Algorithms must not have excessive
complexity and consequent computational processing cost.– Ontologies must be kept current– Feedback to editors must be acceptable
• Certain algorithms have very high complexity (e.g. Brown et al.92 where it O(V5) where V = no. of types in the corpus.
![Page 17: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/17.jpg)
17
4. Lone Labels• The algorithm generates nodes with
simple labels consisting of only one term.– Complex labels are not user friendly– Some approaches (e.g. Scatter/Gather) generate
complex labels– This does not preclude synonyms acting as
alternative labels– A bag of words is not acceptable
![Page 18: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/18.jpg)
18
5. Data Source
• The algorithm(s) must use texts (corpora) as primary data sources, AND allow the extension of existing ontologies.– Written texts are the most appropriate data
sources (quantity, quality, accessibility)– ‘Seed’ ontologies or existing complex data
structures need to be taken into account, e.g. the company’s own ‘top-level’
![Page 19: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/19.jpg)
19
Techniques
• Linking terms– Methods to associate one word/term with
another
• Organising terms– Methods to organise terms into a structure e.g.
a hierarchy
• Labelling term relations– Methods to label the relationship between
terms
![Page 20: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/20.jpg)
20
Linking/associating terms• Objective: Given term a, list a set of
associated terms {b1, b2, b3, ……bn}
• Many, many techniques: correspondence analysis, distributional analysis, using MI in a window etc. ….
a
b c
d e
α
γ δ
![Page 21: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/21.jpg)
21
Linking/associating terms:Associated Words
• Idea of Mike Scott (among others):– Input: corpusa + reference corpusb
– Key words = unusually frequent words in corpusa in comparison with reference corpusb
– Key-key words = words which are key in more than one text, the more text, the more key.
– Associated words of wi = key words which co-occur in the same texts
• 2 factors: i. comparison with reference corpus, and ii. cross-text frequency
• Results can be very good (e.g. when using an encyclopaedia) but poor when using random texts.
![Page 22: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/22.jpg)
22
Linking/associating terms: Colocational Similarity
• Idea of Hays (among others):– Similarity between terms is measured by number of
identical words in a window (citation), plus number of identical words in identical positions (distance)
– He argues very effective in identifying similarity of meaning (95% + accuracy)
– but it works only 30% of the time (i.e. only 30% of citations show similarity to another citation)
![Page 23: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/23.jpg)
23
Linking/associating terms: Syntactic Similarity
• Grefenstette (1994):– Texts are shallow parsed and for each term, the
words in specific syntactic relations are collected as ‘attributes’.
– The set of attributes for each term are compared with the set for each other term using the Jaccard measure.
– Example result:
tissue cell | growth cancer liver tumor | resistance disease lens
but also this produces term: antonyms as an output.
![Page 24: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/24.jpg)
24
Constructing Hierarchies
• Objective: Construct trees or Directed Acyclic Graphs from the terms in the vocabulary of the texts.– Relations may or may not be specified between
nodes– Major problem is obtaining (candidate) labels for
a specific cluster
![Page 25: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/25.jpg)
25
Constructing Hierarchies (2)
• Again many methods exist:– Brown et al. (1992) merges classes based on MI
• very high computational cost• No labels on nodes• No possiblity of integrating with an existing data structure
day
year
week
month
quarter
half
![Page 26: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/26.jpg)
26
Constructing Hierarchies (3)• McMahon and Smith (1996) combine top-
down with bottom up cluster formation.– lower computational cost,– still no labels
• Scatter/Gather developed at Xerox – Strictly speaking only for documents not terms– Computationally tractable– Generates labels BUT consisting of many terms
![Page 27: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/27.jpg)
27
Constructing Hierarchies (4)
• Sanderson & Croft– Document-based lexical subsumption– Generates single term labels– Could allow use of existing hierarchy/
taxonomy
• Problem of coherence rock
igneous rock
basalt
rock (800 files)
basalt (129 files)
igneous rock (29 files)
![Page 28: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/28.jpg)
28
Labelling Relations• Most difficult challenge because:
– There is no set of commonly accepted relations (cf. parts of speech)
– There is no known correlation between a relation (e.g. ‘meronym’) and specific patterns in texts.
– It is an open question whether there is sufficient lexico-syntactic encoding in texts to make the establishment of relations between concepts ‘extractable’ from texts.
• Few methods exist ….
![Page 29: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/29.jpg)
29
Labelling Relations:Synonyms & Substitutability
• Identification of synonyms by ‘substitutability’ tests (Church et al. 1994)– Uses t-test to determine the significance of the overlap
between the syntactic objects of different verbs– Result is a table of candidate substitutes of a given verb– BUT, the result is “not always one that fits nicely into a
familiar category such as synonymy, antonymy, and hyponymy” (ibid.)
• Hays (1997) similar work using collocational similarity
![Page 30: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/30.jpg)
30
Labelling Relations: Hyponyms and ‘Lexico-syntactic patterns’
• Identification of hyponyms by ‘lexico-syntactic patterns’ (Hearst 1994), e.g.
such NP as {NP, } 8 {(or| and)} NP e.g.: …works by such
authors as Herrick, Goldsmith, and Shakespeare. – Considerable manual effort involved in
identifying patterns (also language specific)– Developed by Morin (1999) but no evaluation
![Page 31: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/31.jpg)
31
Labelling Relations:Adaptiva – Ontology building as IE
Lexico-Syntactic Patterns
Pairs of Terms
Labelled Relations
User Validation
User Validation
Pattern Extraction using Amilcare
Pattern Extraction using Amilcare
Based on Ciravegna’s ‘Lazy-NLP’ concepts and the Amilcare adaptive IE engine.
![Page 32: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/32.jpg)
32
User & System Characteristics:A co-operative model of user/system interaction in the
context of Knowledge Management
• Characteristics of the user– Non-specialist– Can select a seed ontology– Can validate sentences as exemplars– Can label a relation exemplified
• Characteristics of the system– Can process text at high speed– Can identify regularities– Can cluster patterns– Can establish that a relationship exists between term x
and term y
![Page 33: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/33.jpg)
33
Adaptiva
• Input:– A corpus– A seed ontology, or some pairs of terms– A relation chosen or labelled by the user
• Output:– A set of pairs of terms associated with labelled lexico-
syntactic patterns– An extended ontology
• Key concept is an effective User Interface to allow user validation/ training of the system
![Page 34: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/34.jpg)
34
The Adaptiva System Ontology
Pair of Terms
Amilcare/ a rule learning system
1. Training using examples 2. Retrieval of f urther
unlabelled examples
Examples with proposed
classification
User’s Feedback
Labelled set of examples
Corpus
![Page 35: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/35.jpg)
35
Adaptiva Interface
![Page 36: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/36.jpg)
36
Building on Labelled Relations
• Most existing methods inadequate• Need to combine methods to
compensate for different weaknesses – Effective pre-processing– Candidate associations from statistical
methods– Term recognition
• Use existing ontologies (e.g. Gene Ontology) to provide candidate data for machine learning
![Page 37: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/37.jpg)
37
Background and Foreground Knowledge in Dynamic Ontology
Construction
![Page 38: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/38.jpg)
38
Overview
• Texts as Knowledge Maintenance• Ontologies and Texts• Implicit and Explicit Knowledge• A Methodology• External resources: Potentials and Limitations• Conclusion
![Page 39: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/39.jpg)
39
A shared conceptualisation
• An ontology is “a formal, explicit specification of a shared conceptualisation, used to help programs and humans to share knowledge” (Gruber 1993)
• Shared! i.e. concepts held in common by the participants/ community of practice
• Therefore the ontology is the background knowledge assumed by the writer/reader of a text.
![Page 40: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/40.jpg)
40
Text and Knowledge
• We want to generate ontologies from text
• But if an ontology = shared/background knowledge, then a writer assumes the ontology to generate the text
TextsOntology
![Page 41: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/41.jpg)
41
Knowledge Maintenance• A text is an act of knowledge maintenance:
– Re-enforcing assumptions of background knowledge
– Altering links, associations and instantiations of existing concepts
– Adding new concepts to the domain
Transport factors in the karyopherin- (also called importin- ) family mediate the movement of macromolecules in nuclear–cytoplasmic transport pathways. Karyopherin- 2 (transportin) binds a cognate import substrate and targets it to the nuclear pore complex. …. Here we present the 3.0 structure of the karyopherin- 2–Ran GppNHp complex where GppNHp is a non-hydrolysable GTP analogue.
![Page 42: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/42.jpg)
42
Background Knowledge
• If ontology = background knowledge, and background knowledge is implicit,then the text(s) will not express the domain ontology
• Especially true of scientific papers• Less true of introductory textbooks, manuals,
glossaries etc.• We expect to find specification of the ontological
knowledge at the borders of a domain
![Page 43: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/43.jpg)
43
Explicit vs. Implicit• Explicit ontological knowledge is found in
lexico-syntactic phrases (Hearst 1992):• such NP as {NP, } 8 {(or| and)} NP e.g.: …works by such authors as
Herrick, Goldsmith, and Shakespeare.• NP, a NP that e.g. isolation and characterisation
of pbp, a protein that interacts ….• NP and other NPs e.g. … malignant melanomas and other
cancer cell types …
• Implicit is not machine-readable: e.g. ‘death is a biological process’ implied by Britannica article on death
![Page 44: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/44.jpg)
44
An Approach
• Assumption 1: No matter how large the corpus, a major part of the domain ontology will not be specified
• Assumption 2: texts do specify explicitly ontological relations between terms
• Therefore: Go beyond the corpus! Seek external sources to compensate the deficiencies of the corpus
![Page 45: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/45.jpg)
45
Using external sources
Domain Corpus
Ontology Learner
External sources
Low level ontology
Mid and high level
ontology
![Page 46: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/46.jpg)
46
External Sources
• Encyclopaedias
• Textbooks and manuals
• Glossaries (manually identified)
• Google glossaries (i.e. automatically identified)
• The Internet• There are pros and cons for each of these
potential sources
![Page 47: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/47.jpg)
47
Some data
• Using the Gene Ontology
• Using the subset of Nature Corpus
• 10 pairs of terms:1. histolysis isa tissue death
2. flocculation isa cell communication
3. vasoconstriction isa circulation
4. holin isa autolysin
5. aminopeptidase isa peptidase
6. death isa biological process
7. metallochaperone isa chaperone
8. hydrolase isa enzyme
9. ligase isa enzyme
10. conotoxin isa neurotoxin
![Page 48: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/48.jpg)
48
Terms Frequency Common environments
histolysis isatissue death
0 0
0
flocculation isacell communication
0 0
8
vasoconstriction isacirculation
0 0
42
holin isaautolysin
0 0
8
aminopeptidase isapeptidase
3 0
14
death isabiological process
654 0
12
metallochaperone isachaperone
0 0
50
hydrolase isaenzyme
9 2
672
ligase isaenzyme
92 2
672
conotoxin isaneurotoxin
1 0
3
![Page 49: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/49.jpg)
49
Example results
Terms: histolysis isa tissue death
Textual Source
Original corpus 0
Google citations 3, all from the ‘Gene Ontology’
Google glossary 0
Encyclopaedia (Britannica) 1, under ‘lepidopteran’ “tissues of the larva undergo considerable histolysis (breaking down)”
Dictionary (Britannica) histolysis: “the breakdown of bodily tissues”
Terms: ligase isa enzyme
Textual Source
Original corpus 2
Google citations 31k, many contexts where this is derivable:
“DNA ligase: Enzyme involved in the replication and repair”
but also:“ligase is a single polypeptide”“ligase is a 600 kDa
multisubunit protein”
Google glossary 9
Encyclopaedia (Britannica) 1 article with definition: “also called Synthetase any one of a class of about 50 enzymes that ….”
Dictionary (Britannica) ligase: “an enzyme that catalyzes the linking together of two molecules”
![Page 50: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/50.jpg)
50
Textual Source Number of contexts/ articles found
Clear specification of ontological relation
Original corpus 0 – 2 0/10
Google citations 3 - 31,000 6/10
Encyclopaedia 7/10 2/10
Dictionary 7/10 3/10
Overall results
• Using the internet has the highest success rate – but still very poor
![Page 51: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/51.jpg)
51
Problems
• Ambiguity: Many terms are defined for entirely different domains or contexts.
• Perspective: If the ontology has a particular perspective on the world, then the internet may not reflect that, i.e. the internet citations may ‘dumb down’.
• Data sparsity: Zipf law implies certain limits• Ontological coarseness: “vasoconstriction
IS-A circulation” cannot be found
![Page 52: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/52.jpg)
52
Conclusions
• A domain ontology reflects background knowledge
• This implicit i.e. never explicitly stated
• No corpus will ever provide sufficient citations to construct the corresponding ontology successfully
• External sources need to accessed
![Page 53: 1 Automating Ontology Building: Ontologies for the Semantic Web and Knowledge Management Christopher BREWSTER Department of Computer Science, University](https://reader035.vdocuments.us/reader035/viewer/2022062500/56649eec5503460f94bfe1bf/html5/thumbnails/53.jpg)
53
Research Issues
• What external sources exist? What specialised sources exist?
• How does one (automatically ?) identify them? A web services application?
• How can we determine what ‘knowledge’ is absent from the corpus and decide to search elsewhere?
• Can we use external sources to vote on the ontological statement to be derived?
• What do we trust? The domain corpus, the internet, our human intuition? Why?