Eric SievertsMedia, information & communication
Amsterdam University of Applied Sciences /
Section Innovation & DevelopmentUniversity Library Utrecht
A pair of shoes in the thesaurusreflexions on human and computer indexing
Society of Indexers Conference 2010 The challenging future of indexing30 September 2010, Middelburg
the holy grail for search systems:
let people find what they search
• searching in the world of Google
• what's wrong with Google (and alikes)
• metadata and indexing
• indexing and knowledge organization
• knowledge organization and the semantic web
agenda
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
searching in the world of
Google appears to be "the measure of all things" in search:
– with Google "everything can be found"
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
searching in the world of
Google appears to be "the measure of all things" in search:
– with Google "everything can be found"
but isn't there a paradox ?
– if Google (or Yahoo! or Bing) contains everything (> 500.000.000.000 items) can "it" still be found ?
>> anticipation of user's intentions & peerless ranking algorithms become increasingly important
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
search, search,search, search,search, ......
searcher / query documents
match
the basic search-and-find paradigm
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
search, search,search, ......
validity for free-text matching ?
match
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
(paraphrasing a Dutch poetry title "Lees maar er staat niet wat er staat")
"just read;
it does not mean what you're reading"
• How does Google know what you mean?• How does Google know what a document means?
filename: thesaurus.jpg
is this meant to be representative for the ease of use of thesauri?
to what query is this Google's answer ?
Want to know something about "hallenkerken" (Dutch for "hall church") thru Google Books?
Google's first hit is a book about building thesauri, containing the word in a single example of broader and narrower terms
searching in the world of
The new Google Instant tries to predict
user intent(the holy grail for search engine developers)
after typing 1 or 2 letters it already presents results
for statistically most probable (longer) words
but is Google really guessing right?
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
match
classical situation with controlled human indexing
searcher must enter the "term(s)" that have been used to characterize the subject
indexer must assign “correct” terms to characterize the document
in principleperfect match
is possible
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
search, search,search, ......
match
not user-friendly: searcher has to invent the correct terms
expensive: indexers must analyze the document in order to assign the correct terms
however
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
search, search,search, ......
classical situation with controlled human indexing
search in the world of
searcher just types some words (or often only one single word)
search system contains (all) the words from the documents themselves
often you don'tfind all you need- still satisfied ?
match
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
search, search,search, ......
why still user satisfaction ?
despite recall and precision problems:• search system looks attractively simple• searcher always finds something (in 500 billion web pages)• smart relevance ranking,
providing some relevant items among first 10 for most (simple) questions, for majority of users,very often even #1 already
and: who cares about lousy recall & precision (in the Google -world)?
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
language technology at searcher side
original simple query expanded & disambiguated
statistics generate additional terms to refine queries
search system contains just the words from the documents themselves
improved querieswill result in
better answers ?
match
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
search, search,search, ......
language technology for better "query"
"word stemming" and "fuzzy search" : automatically search for more wordforms >> better recall
semantic network (or ontology) contains semantic relations between words : query expanded with semantically related terms >> better recall
for different meanings of a word, a semantic network (or ontology) contains relations with different words >> disambiguation >> better precision
no scientific evidence yet about how much improvement
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
language technology for better "query"
• statistical analysis of search result generates characteristic terms, from which user can choose to refine its query
• such words can also be derived from a synonym list, thesaurus, semantic network et cetera
mostly >> better precision
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
language technology at the document
search with "correct" or “important” terms
language technology enriches document with "correct" term (from thesaurus) or derives characteristic terms from the text
in principleperfect match
is possible
match
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
search, search,search, ......
automatic classificationautomatic classification
automatic classification or enrichment
1. deriving specific terms from the document itselfon the basis of word lists and text analysis specific types of terms (e.g. names of persons, places, products, parties, companies, etc.) can be recognized and marked as such
2. adding characteristics to classify a documentafter training it, a system can analyze documents and classify them with terms from a thesaurus or with classes from a taxonomy
despite some limitations it's getting better all the time
even for less tangible tasks as sentiment analysis
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
The Calais Web Serviceautomatically createsrich semantic metadata
Named Entities
Facts Events
geographical recognition in Google Books
training a systemthesaurus
training documents
analysismodule
“finger-prints”
trainingmodule
enrichmentof
thesaurus
Joop van Gent, Irion
classification with systemenriched
thesaurusnew documents
analysismodule
“finger-prints”
classificationmodule
Joop van Gent, Irion
enricheddocuments
endgame tips: checkmate with bishop and knight (in Dutch: "horse")
chess
equestrianism
knowledge
organization
systems
metadata:more than
keywords orthesauri ?
knowledge organization systems can be more thanjust metadata models or tools for subject indexing
4 types of KOS :
• categorization systems (like classifications and taxonomies)
• metadata models (like MARC or Dublin Core)
• relational models (like thesauri, semantic networks, ontologies)
• term lists (like authorization files)
more about ontologies in a moment
knowledge organization systems
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
4 types of functions for KOS:
• description and labeling (e.g. subject indexing with a thesaurus)
• definition (e.g. specification of the meaning of concepts in a thesaurus or ontology)
• translation (e.g. concordance between systems for interoperability)
• navigational (thru the systematic structure of a taxonomy or classification, or the hierarchy of concepts in a thesaurus or ontology)
some of these play a role in the semantic web
knowledge organization systems
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
• "knowledge-representation“ in which knowledge about (a small part of) the world is stored
• mostly not directly used for subject indexing• allows more complete and complex representations of
reality than a thesaurus• with many possible types of relations between concepts• with fixed roles and properties of these concepts• often for limited domains (“wine ontology”)• sometimes broader in so-called “core ontologies”
for example: CIDOC-CRM (conceptual reference model) for concepts, relations and properties in the field of cultural heritage
ontologies
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
relations between some concepts in a simple "wine ontology"
example of the relations between concepts aboutthe statue of Balzac by Rodin [in CIDOC-CRM]
semantic web
“ontologies” in relation to the semantic web
• in a more general connotation :
general name for all kinds of subject indexing (thesauri, classifications, taxonomies, name authority lists, .....)
• essential requirements :
ontology must be available in a form that can be read, interpreted and processed by a computer program
→ needs notations and formal languages to describe them
ontologies
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
ontology notation for semantic web
RDF resource description frameworkstandard to describe relations between object and its metadata
OWL web ontology languagestandard for computer readable description of ontologies
RDFS RDF-schemastandard for description of a KOS in RDF
SKOS simple knowledge organization systemstandard for describing KOSses and relations between them in RDF
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
• RDF uses XML to describe the relation between a resource (or object), its metadata and the used metadata standards
• resources should have a URI to refer to them
• RDF uses “namespaces” to refer to computer-readable description of the standards (link via URL)
• RDF is meant to (re)use and to combine existing semantic systems
• properties (metadata) are registered in so-called triples: subject <predicate> object
(which we could perhaps also write: thing <property> value )
• RDF-triples are used in "linked data"
Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | [email protected]
resource description framework
rdf triples
subject <predicate> object
doc1 <has author> auth1
auth1 <has name> john smith
auth1 <has affiliation> home inc.
auth1 <has email> [email protected]
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
graphical representation ofsimple network of 4 RDF-triples
SKOS-representation ofthesaurus term & relationscan be described in RDF
Term: Economic cooperation Used For: Economic co-operation Broader terms: Economic policy Narrower terms: Economic integration, European economic cooperation, European industrial cooperation, Industrial cooperation Related terms: Interdependence Scope Note: Includes cooperative measures in banking, trade, industry etc., between and among countries.
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
SKOS representation in RDF<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:skos="http://www.w3.org/2004/02/skos/core#"><skos:Concept> <skos:prefLabel>Economic cooperation</skos:prefLabel> <skos:altLabel>Economic co-operation</skos:altLabel> <skos:scopeNote>Includes cooperative measures in banking, trade, industry etc., between and among countries. </skos:scopeNote> <skos:broader> <skos:Concept> <skos:prefLabel>Economic policy</skos:prefLabel> </skos:Concept> </skos:broader> <skos:related> <skos:Concept> <skos:prefLabel>Interdependence</skos:prefLabel> </skos:Concept> </skos:related> <skos:narrower> <skos:Concept> <skos:prefLabel>Economic integration</skos:prefLabel> </skos:Concept> </skos:narrower> <!-- ...more narrower terms omitted ... --></skos:Concept></rdf:RDF>
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
RDF and "linked data"
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
a lot of buzz recently about "linked (open) data"• it's just RDF-triples
• so it's computer readable
• it's on the internet
• so it's open
• it's meant to be re-used
• so it's an important ingredient for the semantic web
• it's standardized
• so it can be re-used
• everybody can (and has to) contribute data
• so it is also somewhat messy
the "linked data cloud" - september 2010 - 24 billion RDF triples online
viaf: virtual internationalauthority filedbpedia: data
from Wikipedia
last.fm: artists
geonames:6.2 M toponyms
BBC: wildlifefinder
LCSH
Reuters:openCalais
IMDB
topic maps
XML-based information systems
• that can be considered as ontologies
• that need no additional notations and/or standards to make them computer-readable
• that combine knowledge representations and the indexed information in a single self-containing, interlinked system
• suited to make local knowledge accessible
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
topic maps
consist of:• concepts (=topics)• that are being characterized with
– “names” (can be any word - even multiple- to describe them) (names are topics themselves as well!)
– “types” (describing to what class of concepts it belongs) (types are topics themselves as well!)
– “associations” (specified types of relations between topics) (associations are also topics, thus having types!)
– “occurrences” (information-items “about” the concept-topic) (occurrences are also topics, thus having types!)
• all of this described in XML
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
verdi puccini
lucca
italyitaliaitaliëitalien
tosca
madame-butterflymadama-butterfly
romarome
occurrences
situated in
influenced
composed
location for
place of birth
simple example of opera topic-mapadopted from Pepper
association types
topic types
composer
opera
city
country
© Antony Pitts, Kal Ahmed, MusicDNA
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
topic map applicationRoyal Academy of Music in London developed a model todescribe "everything" around music, from work/composition to experience of a particular performance
conceptuallysimilar torelationalFRBR modelin library world
© Antony Pitts, Kal Ahmed, MusicDNA
semantic web
• ultimate application of interoperability
• using combination of methods and standards for storing, structuring, filling, formalizing, describing and interpreting metadata
– RDF(S) – ontologies (as well as thesauri, taxonomies, semantic networks, …) – formal languages (like SKOS and OWL)– annotation of resources/objects (=subject indexing)
• so that computers will be able to interpret meaning and to combine knowledge from separate systems
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
© Guus Schreiber UvA / VU
rdf annotation of web resource
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010
iconclass annotation
Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | [email protected]
© Guus Schreiber UvA / VU
"species ontology"
search, search,search, search,search, ......
match
the semantic web (and interoperability) still require a lot of subject indexing, but with smart systems that:
• (help to) index dumb documents• can infer meaning• can match heterogeneous metadata • can improve dumb searches
even a monkey may find correct information,even information he didn't know he was looking for
Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010