Download - A pair of shoes in the thesaurus; some reflexions on human and computer indexing

Eric SievertsMedia, information & communication

Amsterdam University of Applied Sciences /

Section Innovation & DevelopmentUniversity Library Utrecht

A pair of shoes in the thesaurusreflexions on human and computer indexing

Society of Indexers Conference 2010 The challenging future of indexing30 September 2010, Middelburg

the holy grail for search systems:

let people find what they search

• searching in the world of Google

• what's wrong with Google (and alikes)

• metadata and indexing

• indexing and knowledge organization

• knowledge organization and the semantic web

agenda

Eric Sieverts | [email protected] | [email protected] | http://www.library.uu.nl/medew/it/eric | Middelburg 30-9-2010

searching in the world of

Google appears to be "the measure of all things" in search:

– with Google "everything can be found"



Google appears to be "the measure of all things" in search:

– with Google "everything can be found"

but isn't there a paradox ?

– if Google (or Yahoo! or Bing) contains everything (> 500.000.000.000 items) can "it" still be found ?

>> anticipation of user's intentions & peerless ranking algorithms become increasingly important


search, search,search, search,search, ......

searcher / query documents

match

the basic search-and-find paradigm


search, search,search, ......

validity for free-text matching ?

match


(paraphrasing a Dutch poetry title "Lees maar er staat niet wat er staat")

"just read;

it does not mean what you're reading"

• How does Google know what you mean?• How does Google know what a document means?

filename: thesaurus.jpg

is this meant to be representative for the ease of use of thesauri?

to what query is this Google's answer ?

Want to know something about "hallenkerken" (Dutch for "hall church") thru Google Books?

Google's first hit is a book about building thesauri, containing the word in a single example of broader and narrower terms


The new Google Instant tries to predict

user intent(the holy grail for search engine developers)

after typing 1 or 2 letters it already presents results

for statistically most probable (longer) words

but is Google really guessing right?


match

classical situation with controlled human indexing

searcher must enter the "term(s)" that have been used to characterize the subject

indexer must assign “correct” terms to characterize the document

in principleperfect match

is possible



match

not user-friendly: searcher has to invent the correct terms

expensive: indexers must analyze the document in order to assign the correct terms

however



classical situation with controlled human indexing

search in the world of

searcher just types some words (or often only one single word)

search system contains (all) the words from the documents themselves

often you don'tfind all you need- still satisfied ?

match



why still user satisfaction ?

despite recall and precision problems:• search system looks attractively simple• searcher always finds something (in 500 billion web pages)• smart relevance ranking,

providing some relevant items among first 10 for most (simple) questions, for majority of users,very often even #1 already

and: who cares about lousy recall & precision (in the Google -world)?


language technology at searcher side

original simple query expanded & disambiguated

statistics generate additional terms to refine queries

search system contains just the words from the documents themselves

improved querieswill result in

better answers ?

match



language technology for better "query"

"word stemming" and "fuzzy search" : automatically search for more wordforms >> better recall

semantic network (or ontology) contains semantic relations between words : query expanded with semantically related terms >> better recall

for different meanings of a word, a semantic network (or ontology) contains relations with different words >> disambiguation >> better precision

no scientific evidence yet about how much improvement


language technology for better "query"

• statistical analysis of search result generates characteristic terms, from which user can choose to refine its query

• such words can also be derived from a synonym list, thesaurus, semantic network et cetera

mostly >> better precision


language technology at the document

search with "correct" or “important” terms

language technology enriches document with "correct" term (from thesaurus) or derives characteristic terms from the text

in principleperfect match

is possible

match



automatic classificationautomatic classification

automatic classification or enrichment

1. deriving specific terms from the document itselfon the basis of word lists and text analysis specific types of terms (e.g. names of persons, places, products, parties, companies, etc.) can be recognized and marked as such

2. adding characteristics to classify a documentafter training it, a system can analyze documents and classify them with terms from a thesaurus or with classes from a taxonomy

despite some limitations it's getting better all the time

even for less tangible tasks as sentiment analysis


The Calais Web Serviceautomatically createsrich semantic metadata

Named Entities

Facts Events

geographical recognition in Google Books

training a systemthesaurus

training documents

analysismodule

“finger-prints”

trainingmodule

enrichmentof

thesaurus

Joop van Gent, Irion

classification with systemenriched

thesaurusnew documents

analysismodule

“finger-prints”

classificationmodule

Joop van Gent, Irion

enricheddocuments

endgame tips: checkmate with bishop and knight (in Dutch: "horse")

chess

equestrianism

knowledge

organization

systems

metadata:more than

keywords orthesauri ?

knowledge organization systems can be more thanjust metadata models or tools for subject indexing

4 types of KOS :

• categorization systems (like classifications and taxonomies)

• metadata models (like MARC or Dublin Core)

• relational models (like thesauri, semantic networks, ontologies)

• term lists (like authorization files)

more about ontologies in a moment

knowledge organization systems


4 types of functions for KOS:

• description and labeling (e.g. subject indexing with a thesaurus)

• definition (e.g. specification of the meaning of concepts in a thesaurus or ontology)

• translation (e.g. concordance between systems for interoperability)

• navigational (thru the systematic structure of a taxonomy or classification, or the hierarchy of concepts in a thesaurus or ontology)

some of these play a role in the semantic web

knowledge organization systems


• "knowledge-representation“ in which knowledge about (a small part of) the world is stored

• mostly not directly used for subject indexing• allows more complete and complex representations of

reality than a thesaurus• with many possible types of relations between concepts• with fixed roles and properties of these concepts• often for limited domains (“wine ontology”)• sometimes broader in so-called “core ontologies”

for example: CIDOC-CRM (conceptual reference model) for concepts, relations and properties in the field of cultural heritage

ontologies


relations between some concepts in a simple "wine ontology"

example of the relations between concepts aboutthe statue of Balzac by Rodin [in CIDOC-CRM]

semantic web

“ontologies” in relation to the semantic web

• in a more general connotation :

general name for all kinds of subject indexing (thesauri, classifications, taxonomies, name authority lists, .....)

• essential requirements :

ontology must be available in a form that can be read, interpreted and processed by a computer program

→ needs notations and formal languages to describe them

ontologies


ontology notation for semantic web

RDF resource description frameworkstandard to describe relations between object and its metadata

OWL web ontology languagestandard for computer readable description of ontologies

RDFS RDF-schemastandard for description of a KOS in RDF

SKOS simple knowledge organization systemstandard for describing KOSses and relations between them in RDF


• RDF uses XML to describe the relation between a resource (or object), its metadata and the used metadata standards

• resources should have a URI to refer to them

• RDF uses “namespaces” to refer to computer-readable description of the standards (link via URL)

• RDF is meant to (re)use and to combine existing semantic systems

• properties (metadata) are registered in so-called triples: subject <predicate> object

(which we could perhaps also write: thing <property> value )

• RDF-triples are used in "linked data"

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | [email protected]

resource description framework

rdf triples

subject <predicate> object

doc1 <has author> auth1

auth1 <has name> john smith

auth1 <has affiliation> home inc.

auth1 <has email> [email protected]


graphical representation ofsimple network of 4 RDF-triples

SKOS-representation ofthesaurus term & relationscan be described in RDF

Term: Economic cooperation Used For: Economic co-operation Broader terms: Economic policy Narrower terms: Economic integration, European economic cooperation, European industrial cooperation, Industrial cooperation Related terms: Interdependence Scope Note: Includes cooperative measures in banking, trade, industry etc., between and among countries.


SKOS representation in RDF<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:skos="http://www.w3.org/2004/02/skos/core#"><skos:Concept> <skos:prefLabel>Economic cooperation</skos:prefLabel> <skos:altLabel>Economic co-operation</skos:altLabel> <skos:scopeNote>Includes cooperative measures in banking, trade, industry etc., between and among countries. </skos:scopeNote> <skos:broader> <skos:Concept> <skos:prefLabel>Economic policy</skos:prefLabel> </skos:Concept> </skos:broader> <skos:related> <skos:Concept> <skos:prefLabel>Interdependence</skos:prefLabel> </skos:Concept> </skos:related> <skos:narrower> <skos:Concept> <skos:prefLabel>Economic integration</skos:prefLabel> </skos:Concept> </skos:narrower> </skos:Concept></rdf:RDF>

RDF and "linked data"


a lot of buzz recently about "linked (open) data"• it's just RDF-triples

• so it's computer readable

• it's on the internet

• so it's open

• it's meant to be re-used

• so it's an important ingredient for the semantic web

• it's standardized

• so it can be re-used

• everybody can (and has to) contribute data

• so it is also somewhat messy

the "linked data cloud" - september 2010 - 24 billion RDF triples online

viaf: virtual internationalauthority filedbpedia: data

from Wikipedia

last.fm: artists

geonames:6.2 M toponyms

BBC: wildlifefinder

LCSH

Reuters:openCalais

IMDB

topic maps

XML-based information systems

• that can be considered as ontologies

• that need no additional notations and/or standards to make them computer-readable

• that combine knowledge representations and the indexed information in a single self-containing, interlinked system

• suited to make local knowledge accessible


topic maps

consist of:• concepts (=topics)• that are being characterized with

– “names” (can be any word - even multiple- to describe them) (names are topics themselves as well!)

– “types” (describing to what class of concepts it belongs) (types are topics themselves as well!)

– “associations” (specified types of relations between topics) (associations are also topics, thus having types!)

– “occurrences” (information-items “about” the concept-topic) (occurrences are also topics, thus having types!)

• all of this described in XML


verdi puccini

lucca

italyitaliaitaliëitalien

tosca

madame-butterflymadama-butterfly

romarome

occurrences

situated in

influenced

composed

location for

place of birth

simple example of opera topic-mapadopted from Pepper

association types

topic types

composer

opera

city

country

© Antony Pitts, Kal Ahmed, MusicDNA


topic map applicationRoyal Academy of Music in London developed a model todescribe "everything" around music, from work/composition to experience of a particular performance

conceptuallysimilar torelationalFRBR modelin library world

semantic web

• ultimate application of interoperability

• using combination of methods and standards for storing, structuring, filling, formalizing, describing and interpreting metadata

– RDF(S) – ontologies (as well as thesauri, taxonomies, semantic networks, …) – formal languages (like SKOS and OWL)– annotation of resources/objects (=subject indexing)

• so that computers will be able to interpret meaning and to combine knowledge from separate systems


iconclass annotation

Eric Sieverts | [email protected] | http://www.library.uu.nl/medew/it/eric | [email protected]

© Guus Schreiber UvA / VU

"species ontology"

search, search,search, search,search, ......

match

the semantic web (and interoperability) still require a lot of subject indexing, but with smart systems that:

• (help to) index dumb documents• can infer meaning• can match heterogeneous metadata • can improve dumb searches

even a monkey may find correct information,even information he didn't know he was looking for


Download - A pair of shoes in the thesaurus; some reflexions on human and computer indexing

Top Related