owl, ontologies & textguus/talks/06-otm.pdf · kersen, marco de niet, borys omelayenko, jacco...
TRANSCRIPT
OWL, Ontologies & Text
Challenges from the cultural-heritage domain
Guus SchreiberFree University Amsterdam
2
Overview
Ontologies in general (brief)W3C work on ontologies and ontology engineering for the Semantic Web (brief)Use cases involving ontologies & text (& other media)– Based on cultural-heritage projects we are involved in
3
Acknowledgements
MultimediaN E-Culture Project: – Alia Amin, Mark van Assem, Victor de Boer, Lynda Hardman,
Michiel Hildebrand, Laura Hollink, Zhisheng Huang, Janneke van Kersen, Marco de Niet, Borys Omelayenko, Jacco van Ossenbruggen, Ronny Siebes, Jos Taekema, Jan Wielemaker, Bob Wielinga
CHOICE Project @ Sound & Vision– Hennie Brugman, Luit Gazendam, Veronique Malaise, Johan
Oomen, Mettina VeenstraMuNCH project @ Sound & Vision– Laura Hollink, Bouke Hunning, Michiel van Liempt, Johan
Oomenm Maarten de Rijke, Arnold SmeuldersCees Snoek, Marcel Worring,
4
Semantics for the Web:some challenges
Machine-processable representation of semantic informationDefining semantics in an OPEN environment– Adding semantics to other people’s semantics – Ability for everyone to contribute
Ability to define mappings between semantic representations– There is no uniform way to classify the world!
5
The notion of ontology (as currently used in computer science)
The Semantic Web needs sets of shared conceptsThese sets of concepts are called “ontologies”It is hard and time-consuming to develop ontologiesTherefore, the Semantic Web developers are looking for existing ontologies, vocabularies, taxonomies
6
Ontologies and data models
Main difference with data models is not the content, but the purpose (generalizes over applications)You cannot see the difference by just looking at the syntax!A conceptual model written in a ontology language is not necessarily an ontology!
7
Example “ontologies” for SW applications
Domain-specific vocabularies– Medicine: UMLS, SNOMED, Galen– Art history: AAT, ULAN– Geography: TGN
Generic ontologies – Top-level categories (reminiscent of Aristotelian
categories)– Lexical vocabularies: WordNet– Units and dimensions, time ontology– Currencies, country codes, …
8
Good and bad ontologies?!
Good ontologies are usedGood ontologies represent some form of consensus in a communityGood ontologies are maintainedGood ontologies do not need to be complexGood ontologies may contain “mistakes”
9
RDF/OWL language constructs
classes and individualssubclassespropertiessubpropertiesdomain/range of propertiesXML Schema datatypes
equality, inequality inverse, transitive, symmetric, functional propertiesproperty constraints: cardinality, allValuesFrom, someValuesFromconjunction, disjunction, negation of classeshasValue, enumerated type
10
RDF/OWL family of languages
OWL Full is a vocabulary extension of RDF.The RDF restrictions in OWL DL are there for good technical reasonsTime will have to prove whether there is a place for OWL Lite or some other OWL subset.RDF/OWL: one can view it as an historical artefact that these are not grouped under the same acronym.
11
Is RDF/OWL just another datamodelling/KR language?
Key differences:– All classes/properties/individuals have a URI as
identifier– RDF/XML exchange syntax enables interoperabilityXML features – UTF-8 character set– Support for multilinguality– Use of XML Schema datatypes: numeric, date, time,
etc.For the rest: RDF/OWL is state-of-the-art concept
language
12
Semantic Web Best Practices and Deployment Working Group
Objective: support for semantic-web application developerFocus on “low hanging fruit”Publishing key ontologies/vocabulariesDevelopment guidelines, ontology-design patterns, repositories, links to related techniques, ……
13
Ontology engineering patterns
Best practices for frequently occurring modeling problemsWG documents outline alternatives with pros and consNotes:– Classes as values– N-ary relations– Specification of value sets– Part-of
14
Metamodelling
OWL DL requires strict separation of classes and instancesBut on the Semantic Web my instances may be your classes!Metamodelling features especially required in vocabulary/ontology mapping and/or interpretationCf. Protégé metamodelling facilitiesOWL 1.1 (not standardized) allows limited metamodelling within OWL DL scope
15
Example: WordNet
Class(LexicalConcept)Class(Noun subClassOf(LexicalConcept))Property(hyponymOf
domain(LexicalConcept) range(LexicalConcept))
Individual(1000768 type(LexicalConcept)wordForm(Human))
Problem: how to use the hyponym hierarchy as a subclass hierarchy?
16
RDF solution: use metamodelling
subClassOf(LexicalConcept Class)subPropertyOf(hyponymOf subClassOf)subPropertyOf(wordForm rdfs:label)
Corresponds to our intuition that WordNet model is a metamodel
17
Thesauri and ontologies
Semantic Web Challenge showed that thesauri are important resources for SW applicationsTypically weak semantic structureApproach in w3c Semantic Web Best Practices WG:– Phase 1: “as-is” conversion– Phase 2: additional ontological
interpretations/extensions
18
New W3C work: Semantic Web Deployment Working Group
Mission to help in vocabulary deploymentChartered to standardize SKOSPattern for RDF/OWL representation of (ISO-compliant)
thesauriGuidelines for adding semantics to existing vocabularies
MultimediaNPilot E-Culture
20
Hypothesis
Semantic Web technology is in particular useful in knowledge-rich domains
or formulated differently
If we cannot show added value in knowledge-rich domains, then it may have no value at all
21
Natural-lang proc.automatic annotation
text stings → concepts
Distributedcultuurwijzer.nl collections
OAI-based access
Reasoning supporttime/space reasoning
Web interfacesupport for web collections
Presentation facilitiessemantic presentation
device-specific
InteroperabilityXML/RDF/OWL
Scalability> 10,000,000 triples
OntologiesWordNet, AAT, TGN ULAN, Dutch labels
Search strategiessibling searchsemantic distance
Dublin Corespecializationsdumb-down
semantic annotationDIGITAL HERITAGE
COLLECTIONSsemantic search
BASELINEENHANCEDENHANCEDFEATURESFEATURES
NEWNEWFEATURESFEATURES
22
Use of thesauri
RDF/OWL data models of Getty thesauri– Issues: scope, preserving structure
WordNet: W3C SWBPD workhttp://www.w3.org/TR/wordnet-rdf/
Multilingualism– Dutch version of AAT
Existing collection metadata are parsed to find matches in thesauri (e.g. creator name => ULAN entry)
23
24
WordNet synsets, senses and words have URIs
25
On-line demohttp://e-culture.multimedian.nl
26
Use case: Asian chairs
User has found an image of an Asian chair
Annotation:ex:image vra:stylePeriod aat:Guangxu .
How can we find images of Asian chairs from the same historical period?
27
AAT info on Guangxu
28
29
Observations
Many queries require time/space knowledge, either absolute or abstractedFor the chair image we can establish– Country = China (link Chinese => China)– Period = 1644-1911 (from Qing description)
Technology requirements:– Thesauri relating time/space concepts– NLP for unstructured descriptions– Time/space reasoning techniques
30
Use case: existing annotations
MATISSE, HenriLe bonheur de vivre (The Joy of Life)1905-1906Oil on canvas, 69 1/8 x 94 7/8 in. (175 x 241 cm)Barnes Foundation, Merion, PA
31
Textual annotation mapped to thesauri terms
32
Use case: how can we find this other Fauve painting?
DERAIN, AndreThe Turning Road, L'Estaque, 1906Oil on canvas, 51 x 76 3/4 in. (129.5 x 195 cm)Museum of Fine Arts, Houston, Texas
33
Issues w.r.t. the use case
Parse annotation to find matches with thesauri terms– E.g. match artists to ULAN individuals
Artists-style links– AAT contains styles; ULAN contains artists, but there
is no link• Learn link from corpora• Derive it from other annotations
– Domain-specific rules/reasoning needed • see example in SWRL doc• Painters may have painted in multiple styles
34
Use case: extracting additional knowledge from scope notes
35
Use case: semantics for query expansion (Hollink)
36
Issues
Many thesauri do not have a rich semantic structure like WordNetNeed for learning additional semantic relations between thesaurus conceptsResult: “ontologizing thesauri”NLP is crucial technique
37
Use case: supporting annotation of broadcasts
Current situation: mainly manualNot feasible for large-scale digital archivingContext documents for programs can be identifiedCan we generate candidate annotation?Example from CHOICE project
38
Issues
Broadcast archives have their own annotation template– Typically specialization of Dublin Core
In-house thesaurus is usually available, but may be of limited use– Consider including other (public) thesauri
Multi-linguality is prominent issue
Our experience: key role for user studies– Dramatic changes of the existing business process of
the archive
39
Use case: concept detectors in video (Snoek et al)
40
Challenges
Extremely tough problemExample data: TRECVID 2005Approach: combine content-based image retrieval with NLP and ontologiesIssue (among many others): context-specificity of TRECVID thesaurus
LSCOM lexicon: 229 - Weather
41
LSCOM lexicon: 110 – Female Anchor
Composite conceptAlignment needed for semantic search, e.g. with WordNet
42
Main observation of this talk
A combination of many different techniques is needed to be able to cope with the complexity of multimedia semantics– NLP, segmentation, CBIR, visual feature detectors,
visual ontologies, publicly available thesauri, thesauri mappings, dedicated reasoning techniques (time, space, default), personalization, presentation generation
Multi-disciplinary approach is a must– And methods that combine text and ontologies are
key (but not only) element of such an approach