1 thesaurus building martin doerr center for cultural informatics institute of computer science...

66
1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens June 17, 2013

Upload: barrie-bridges

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

1

Thesaurus Building

Martin DoerrCenter for Cultural Informatics Institute of Computer Science

Foundation for Research and Technology - Hellas

AthensJune 17, 2013

Page 2: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Overview

Motivation and Definitions Words, Terms and Concepts Knowledge Organisation Systems Thesaurus structure Thesaurus construction Examples

2

Page 3: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Motivation

The simple idea to standardise expressions of classification for better communication: Results in encyclopedia, knowledge bases, touches language engineering,

cognitive science. Becomes a major issue of electronic communication and information access.

Questions: is this item well characterized by this term? Would every expert expect to find this object under this same term? If not, would such terms be variants of the same concept? Is there a unique answer to “what is this”? Does this database contain descriptions of things falling under this term?

Historically: Roget’s Thesaurus to assist writers with better words…

3

Page 4: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Words, Terms, Concepts

Words Constituents of natural languages. Categorical meaning, in contrast to

“proper names”. Multiple senses depend on context. (Example: “order”)

Term Constituent of expert language. A word with a specific (categorical) meaning,

either defined in a scientific document or common to an expert group and discipline. (Example: “hepatitis A”)

Concept A class or set of items grouped together on the basis of some implicit or

explicit criterion or rule. The criterion can be unconscious or even innate ! (Example: “δημόσιος υπάλληλος”).

A concept is not a term and not a language element!

4

Page 5: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Functions of Terminology

Unambiguous scientific expression Use in expert discussions, expert opinions (diagnoses!) and scientific

publication. Defined in disciplinary dictionaries.

Research Defined ad-hoc to discriminate items in a research project (archeology!).

Conclude from form on function, form on provenance etc.

Data search Find all items (publications, objects etc.) possibly relevant for my research

question.

Unfortunately, each function needs a different approach!

5

Page 6: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

From Words to Concepts

Terms are created by selecting or inventing a word, often a compound (“black-figure pottery”) fixing an expert group (“classical archaeologists”), fixing a scientific context (“antique Greek vases”) Term alone makes no sense (“registration”)

A concept is detected As the sense of a term or one sense of a word or the use of words in a text by analyzing context-specific use (written definitions, interviews, dialogues). A concept may be (first time) created by expressing/writing definitions.

A concept is formally identified By assigning an identifier to a description (“definition”) sufficient to clarify its

meaning and disambiguate it from other concepts.

6

Page 7: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

From Words to Concepts

Understanding Comes from disambiguating the concepts (senses) behind words (terms) in a

context. This can be unconscious, conscious by context analysis, by asking clarifications (dialogue)

Databases and database records are contexts Therefore humans can understand a word in a data field

Computers do not understand senses Therefore machines cannot relate (retrieve) records by common sense Therefore senses must be identified to machines as entities and be related to terms

7

Page 8: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Classification

Concepts are many, words are few LCSH: 500.000 concepts, only general subjects, millions in our mind, UMLS:

over 5.000.000 concepts. Words : some 60.000 in our mind, some 400.000 in a language, some

30.000 in a typical dictionary. A typical thesaurus : 10.000 to 100.000 concepts One word may have some dozens of meanings (referred concepts) One concept may be referred to by several words or terms

Terms are noun phrases, composed of words

Concepts are used to classify things in texts and database records, either by referring to words, terms or concept identifiers.

8

Page 9: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Purpose of Classification 1

Organise a Universe of Discourse by concepts for cognition and comprehension recognition of discriminant attributes, attribute distribution for generalisation of observation for inferences from evidence to cause

exclusive, avoiding “mixed forms”, prototypical, selective on reality

Communication of conceptualisation presentation of a domain of discourse help for exploration of a topic

descriptive, rich, detailed, fuzzy, “cautious”, incomplete

9

Page 10: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Purpose of Classification 2

Determination of items in an automated communication process widely agreed-on naming for kinds of objects we share in a cultural space

e.g. artefact, kris (malayan), analogous or constructive classification of kinds of objects out of our cultural

space with terms from our space e.g. knife, dagger = puuko (finnish)

information seeking by constraining attribute values e.g. weapons, 18th century, south-east Asia

Surrogate role, poor, binary, standardised, comprehensive, recall-oriented rather than detailed.

For electronic communication, prescribe few, mandatory high level terms, refer in data records also all good expert terms

from here on, we only talk about this function

10

Page 11: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Knowledge Organisation Systems

For electronic communication Organize terms, concepts and their relationships into digital (machine readable) dictionaries for human comprehension such that machines can make inferences humans would approve.

Such inferences are identity (get all cats by “cat”) generalization (get “cats” by “felines” related terms (get Heraklion by “Candia”, get “bridge construction” by

“bridges”, get Heraklion by “Crete”)

We call these KOS E.g., LCSH, AAT, geonames, terms lists, ULAN…

11

Page 12: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 12

A dictionary is a listing of words and phrases giving information such as spelling, morphology and part of speech, senses, definitions, usage, origin, and equivalents in other languages (bi- or multilingual dictionary).

A controlled vocabulary is a limited list of terms to be used in a database field. Only an authority may add terms.

Authority files are lists of persons (authors) or places (also gazetters) together with recommended names (controlled).

A classification system is a structure that organizes concepts into a (mono) hierarchy in order to partition some material following a sequence of decision criteria.

Kinds of KOS

Page 13: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 13

An ontology “is a logical theory… …accounting for the intended meaning of a formal vocabulary, i.e. its ontological

commitment to a particular conceptualization of the world. The intended models of a logical language using such a vocabulary are constrained by its ontological commitment. An ontology indirectly reflects this commitment (and the underlying conceptualization) by approximating these intended models.”

We use “ontology” only to formally describe the meaning of information structures

A thesaurus is a controlled vocabulary of categorical terms related to concepts, and with semantic relationships between concepts.

A monolingual thesaurus has terms form one expert group or community

A multilingual thesaurus relates terms and concepts from two or more expert groups or communities (see next slide)

Kinds of KOS

Page 14: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Multilingual thesauri

Translated thesauri: Each concept is optimally interpreted in words of another or multiple languages, to

allow speakers of those languages to understand it better.

Correlated thesauri: Multiple thesauri with terms and concepts from respective groups, and a set of

concept-based mappings between the different thesauri of that aggregate, in order to process queries across different terminologies.

Interlingua: Concepts are created by fusing each cluster of similar concepts from different social

groups into a new concept. One term from each user group is attached to the new concept as the identifier to be used by this group. The interlingua provides the sharing of concepts between social groups, e.g. as a legal basis used by the European Commission like the EBTI. Note that the interlingua may not contain any of the original concepts of any user group; it contains a set of compromises to remove interpretational differences. Its concepts may again be translated and correlated to other thesauri.

14

Page 15: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Multilingual Thesauri Merged

15

AndEnglish heritage thesaurus Merimee Thesaurus

English VocabularyFrench Vocabulary

interlingua

linguistic

translation

linguistic

translation

+/-

interthesaurus correlations

+/- +/- +/- +/-

Page 16: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesaurus Structure

Nodes and Links Nodes for concepts and terms Nodes are reference objects with accepted identity. Links for semantic relations concept-concept, concept-term. Links express opinions, constitute the thesaurus.

3 dimensions to specialize links By meaning. E.g. synonymity: who used, when and in which context this

expression for that concept... By version. When introduced, when withdrawn. By opinion. E.g. Who says, that this concept is subordinate to that...

2 Dominant standards: ISO2788 / ISO5964 and SKOS

16

Page 17: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

ISO2788-1986

Standard about the methodology, entities and relationships of a thesaurus,

but not the format

Entities:

thesaurus preferred term non-preferred term compound term node label (facet indicator) facets

Does not yet clearly distinguish concepts and terms.

Getty Research Institute uses the term “descriptor” for representing concepts

17

Page 18: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

SKOS

Simple Knowledge Organization Systems (SKOS) It provides a model for expressing the basic structure and content of

multilingual concept schemes such as thesauri, classification schemes, taxonomies, subject-heading systems, or any other type of structured controlled vocabulary.

It is the first widely accepted encoding format in RDF Introduces persistent concept identifiers

Tends to be abused for placenames (gazetteers) and person lists (particulars)…

18

Page 19: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 19

Thesaurus Concepts (SIS-TMS)

Term

Preferred Term

UsedForTermTopTerm (concept)

ThesaurusExpression

ThesaurusConcept

ThesaurusNotion

Non-Preferred Term

AlternativeTerm

NodeLabel

Descriptor

HierarchyTerm (concept)

ObsoleteDescriptor ObsoleteTerm

: Generalisation (isA)

Page 20: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Hierarchy ThesaurusNotionType

Object GenresObject Genres

Facet

M1_Class

S_Class

Token

ObjectFacet

TopTerm

HierarchyTerm

Descriptor

fortificationsfortifications

Single Built WorksSingle Built Works

belongs to<single built works><single built works>

BT

Semantic part

Functional part

In InInIn

In n

Logical Thesaurus Structure

Page 21: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesaurus Structure: Concept Record

Intrathesaurus relations (ISO 2788)

Hierarchical Relations (from Concept/Descriptor, to Concept/Descriptor) BT (Broader Term) BTP (Broader Term Partitive) = actual kind of RT BTG (Broader Term Generic) = actual BT (IsA)

Associative Relations (from Concept/Descriptor, to Concept/Descriptor) RT (Related Term) = “world of ontologies”

Equivalence Relations (from Concept/Descriptor, to Term/Language) ALT (Alternative Term) UF (Used For Term) often extended by group/language

Now all thesauri use also a concept identifier (possibly LoD identifier).

21

Page 22: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 22

Thesaurus Structure: Linking Concepts

Interthesaurus relations (ISO 5964):

• partial equivalence SKOS: broader equivalence (is subset of)

narrower equivalence (is superset of)

• exact equivalence (same set as)

• inexact equivalence (overlaps with)

good for FTR only

• single to multiple equivalence (future!)

Page 23: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesaurus Structure

HIERARCHICAL RELATIONSHIPS. AAT Definition

Broader and narrower (parent/child) relationships between concepts. Hierarchical relationships are generally either whole/part or genus/species; in the AAT, most hierarchical relationships are genus/species (e.g., chalice is a type of drinking vessel). Relationships may be polyhierarchical, meaning that each child may be linked to multiple parents. Broader term (BT). Also called a broader context. A vocabulary record to which

another record or multiple records are subordinate in a hierarchy. In thesauri, the relationship indicator for this type of term is BT. Variations on the notation include BTG, (broader term generic), BTP (broader term partitive), BTI (broader term instance), BT1 (broader term level 1), BT2 (broader term level 2), etc.

Narrower term. Also called narrower context. A record to which another record or multiple records are superordinate in a hierarchy (for example, Brewster chair is a narrower term to armchair). In thesauri, the relationship indicator for this type ofterm is NT. Variations on the notation include NTG, (narrower term generic), NTP (narrower termpartitive), NTI (narrower term instance), NT1 (narrower term level 1), NT2 (narrower term level 2), etc.

Do not use BT1,BT2,BTI. Always NT must be inverse of BT. Do not use BT for BTP!

23

Page 24: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesaurus Structure

ASSOCIATIVE RELATIONSHIPS AAT. AAT Definition

The relationships between concepts that are closely related conceptually, but the

relationship is not hierarchical because it is not whole/part or genus/species.

Related term (RT). A concept that is associatively (not hierarchically) linked to another concept in a thesaurus. In thesauri, the relationship indicator for this type of term is RT.

We encourage to define specializations of RT

24

Page 25: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesaurus Structure

“Equivalence relationships”. AAT Definition

The relationships between synonymous terms or names that refer to the same concept, typically distinguishing preferred terms (descriptors) and non-preferred terms (variants, or ALTs and UFs). Alternate descriptor (ALT). A variant form of a descriptor available for use;

usually a singular form or a different part of speech than the descriptor (for example, lithograph is an alternate descriptor for the plural descriptor, lithographs). The relationship indicator for this type of term is ALT.

Used for term. Also called a UF. In thesaurus jargon, a term that is not a descriptor and not an alternate descriptor. If the thesaurus is being used as an authority, a used for term is not authorized for indexing. Used for terms typically comprise spelling or grammatical variants of the descriptor or have true synonymity with the descriptor.

These are now “labels” in SKOS, concept-to-string links.

25

Page 26: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesauri structure

Scope note (AAT Definition ): A Note that describes how the term should be used within the context of the

AAT, and provides descriptive information about the concept or expands upon information recorded in other fields. The Scope Note in AAT is analogous to the Descriptive Note in ULAN and TGN.

26

Page 27: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 27

Page 28: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 28

Example Thesaurus Record

Carmine (lake)

Scope Note, SN:

A generic name for two closely related organic red lakes that are obtained from scale insects, cochineal and kermes. Neither pigment is permanent enough for use in fine art because they discolor in sunlight. They were replaced first by madder and alizarin, then later by synthetic organic red colors.

Broader Terms, BT:

colorant (material), lake (pigment)

Alternative Terms, ALT:

carmine lake

Related Terms, RT:

cochineal (colorant), kermes (colorant)

Used For, UF: carmine lake, carmin (lake), Karmesin lake, new red lake, Kugel lake, Parisian lake, Munich lake, Venetian lake, Karmin (Lack)

Page 29: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 29

AAT term record

Page 30: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 30

AAT term record

Page 31: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesaurus Construction Global knowledge and isolated sources

Most thesauri are small, agreement of few experts, integrated into

one local database, seen from a specific view, in one language. Some thesauri cover large “general” subjects, and fail in specialisation. Scientists and scholars share systems of global concepts. Thesauri should be organised by domains

Examples of different scope and scale: General purpose authorities, high-level: AAT, LCSH, RAMEAU, SWD Specialized vocabularies: Beasley, SHIC, ACM

Use CIDOC CRM for global concepts Relate your concepts to as many thesauri possible via persistent

identifiers. Make sure identity of concept after update.

31

Page 32: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesaurus Construction

Distinguish use case: thesauri for keyword search in free text (not my talk today) thesauri to fill in database (metadata) fields

The process Define a purpose/function Engineer terms from existing vocabularies, dictionaries, interviews Engineer concepts from terms, term use, interviews Relate concepts and terms Write concept records

It is a collaborative problem Manage information for common reference, expressions of opinion,

agreement, disagreement Think of long term maintenance: Only a curated KOS can be used.

32

Page 33: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 33

Define a purpose, for example (from D. Soergel), A classification of diseases for diagnosis A classification of medical procedures for insurance billing A classification of medical outcomes to assist with treatment evaluation A classification of commodities for customs A classification of educational objective for instructional development A classification of occupations for matching job applicants with job openings

and for pay scale A classification of skills for employee task assignments

In cultural heritage, think of research question or preservation functions

Thesauri Construction

Page 34: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Engineering Terms

Words and terms depend on social group and context: Natural language, dialect, scientific language, slang

Σπίνος - fringilla coelebs - chaffinch, σκυλάκι - ορχεοειδές,….

Can be traditional, missing, phrases, “coined”, ad-hoc γιαταγάνι, kalathoi, gilded chairs, the Web, let’s call it...

Appear in different grammatical forms,or combination rules pre-coordinated : “rugs, Persian”, “Persian rugs”, post-coordinated : ”Persia + rug”

Use “coined terms” if necessary. Use “post coordination” (S/W will do it)

34

Page 35: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Engineering Terms

35

Controlling Synonyms

Preferred synonyms Term

Teenager

Inheritance

Teen

Alcoholism

Afro - American

Youth (young person)

Pubescent

Black

Adolescent

Echocardiograpgy

Adolescent

Heredity

African American

Adolescent

Adolescent

African American

Ultrasonic cardiography

Alcohol dependence

Concept-term relationships (terminological structure)

Page 36: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Engineering Terms

36

Stepwise reduction of a set of terms

Page 37: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Engineering Terms

37

Morphological variants consolidated

Spelling variants consolidated

Synonyms consolidated

Quasi- Synonyms

consolidated

Descriptors for- post combination ISAR system

Disease

Illness

Sickness

Ailment

Disease

Illness

Sickness

Ailment

Disease

Illness

Sickness

Ailment

Disease, illness Disease, illness

1 2 3 4 5

Following the lines from right to left, the searcher finds in column 1 all the terms and spelling variants to use.

Stepwise reduction of a set of terms

Page 38: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Engineering TermsDisambiguating homonyms Administration 1 (management) Administration 2 ( drugs) Läufer 1 (Sportler) English: runner (athlete) Läufer 2 (Teppich) English: long, narrow rug Läufer 3 (Schach) English: bishop (chess) Discharge 1 (from hospital or program)

German: Entlassung Discharge 2 (from organization or employment)

Preferred synonym: Dismissal

German: Entlassung Discharge 3 (medical symptom)

German: Absonderung, Ausfluss Discharge 4 (into a river)

German: Ausfluss Discharge 5 (electrical)

German: Entladung (which also means unloading)

38

Page 39: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Classifying by Term: A case

E.g. searching for comparative studies

How do I spell It? Ushabti, ushabty, ushebti, shawtaby?

Will it be written the same everywhere?

Should I call it : “grave goods”(AAT), “burial figurines”,“dolls”, “afterlife

helpers”, “personality surrogate”, “burial ritual”?

And what about “xαρώνειο, δανάκη” ?

Should I call it: “toll”, “cheap coin”, “afterlife helper”,

“corpse equipment”, “burial gift”, “burial rites” ?

Would be “grave goods” distinctive enough?

39

Page 40: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Using Classification for Querying

How to find the characteristic term itself ?

How to discover related literature ?

Relevant abstractions are not standardized

How to make statistics even about the same item?

The same items can be referred in a thousand ways

How to do comparative studies by features ?

Implicit features are not declared, explicit features need systematic

documentation

40

Page 41: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesauri and Classification:A Case of a Term

Analyzing a term: What is an ushebti, what a shawabty ?

What did it mean, and when?

What was is made for?

How was it made?

Where was it used ?

41

Ideas, concepts, rather than words

Multiple aspects of interest !

Page 42: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Concepts

A concept is class or set of entities which are grouped together on the basis of some criterion or rule

1. Inner representation- the personal comprehension cannot be communicated

2. A set of entities characterised by explicit properties (rules) “objective”, allows reasoning about analogous objects from other

cultures/domains BUT: how to characterise properties? find discriminative attributes (what is an Elephant?) non-verbal characteristics (aquarelle etc. ) often difficult, misleading, impossible

42

Page 43: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Concepts

3. the “words of mentalese” the common language of the human mind basis for communication in foreign languages completely unknown

4. A set of entities characterised by common agreement depends on a social group (must be noted!!) covers everything people can recognise and agree on (implicit

mentalese) does not allow for reasoning about analogous objects also called “primitive concept” This is what we need most (eventually plus rules)

43

Page 44: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesauri and Classification:Concepts

Concepts are relative to scope : fuzzy bounds , e.g. knife, weapon, seat,

outer bounds for retrieval, inner for science, if negated inner bounds for retrieval….

to purpose : weapon, friend, stone building, school house, neoclassic building there are essential classes (related to reason for existence) construction-related, morphological, functional, contextual

Concepts are related by nature : coffin - container, coffin - funerary object, bath tube - container

polyhierarchies of genus-species OR isA OR generalization OR subclass-superclass (provides also a notion of similarity)

associative : bridge - bridge construction, house - roof

44

Page 45: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Engineering Concepts

Concepts can be natural and explicit - there is a term for it in some language natural implicit (hidden) - there is no word

English “parts & accessories” , “too” translated to Greek terms need to be invented (“coined term”)

new - like “the Web” compounds - “blue rugs”, “19th century Persian rugs”, open problem

Natural concepts are the best, but often others are needed often contextually overloaded (sword, ushebti) need typically contextual redefinition to become precise (AAT “knife”) or need to be combined with other terms

In particular generic concepts often miss a term!

45

Page 46: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Engineering Concepts

Quality problem: Is classification reliable?

Completeness, at least partial?:

Do things not classified by one term not belong to this term?

Can at least partial sets of data be identified, that are completely

classified with respect to term x ?

Can I find things that may belong to term x under term x?

Classification for retrieval must be “inclusive” and complete for a given

collection

46

Page 47: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Engineering Concepts

Particularly Objects can be seen under different aspects

E.g.: School house, all-wooden building, 18th century American style

Characteristic aspects: functional

morphological

constructive

contextual

Need to make aspect explicit (open problem).

Interesting problem: repurposing resources for other aspects.

47

Page 48: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Concept Definition

By “Scope Note” : A statement that clarifies the meaning and usage of a term within the

thesaurus Definition by properties, occurrence, similarities Definition of scope - limitations and distinctions Guidance of users to similar, overlapping, associated concepts Context of usage, purpose, view Origin and history of the term and concept Reference to literature (“literature warrant”) Examples.

Often the scope note reminds only a certain meaning we share, and restricts it. Examples most helpful as reminder!

48

Page 49: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesauri and Classification:Concept Definition

Assisted by example A particular instance (e.g. Mona Lisa for “painting”)

Optically by graphics, drawing, images of models

Assisted by semantic placement Generalizations / specializations Associations to other concepts = co-occurrence in certain contexts,

producer-process-product relations etc. Synonyms, similar concepts, translations

49

Page 50: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

S. R. Ranganathan

Three cognitive “planes”: Idea plane - Verbal plane - Notational plane confusion hinders analysis and problem solution: Missing terms for existing ideas (concepts are many, words are few) and notational limitations inhibit idea plane work.

The invention of the “facets” Priority of the idea plane (= concept, not term) Conceptual structures are multidimensional Shelving of books is no argument, a taxonomy is not an index.

Colon Classification is a system of library classification developed by S. R. Ranganathan between 1925-1965. It uses five primary categories, or facets, to further specify the sorting of a publication. Collectively, they are called PMEST.

50

Page 51: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesauri and Classification: A “Facet” can be...

Grammatical element of an indexing expression:

e.g. subdivision by period, geography, genre (MARC)

Fundamental category, major facet, basic facet:

Ranganathan: Personality, Matter, Energy, Space, Time

CIDOC CRM: Period, Physical Entity, Conceptual Object, Actor, Place, Time-Span, Type, Material, Language

AAT: Objects, Agents, Activities, Styles and Periods, Materials, Physical Attributes, Associated Concepts.

=> Used to form compound terms and descriptive expressions

51

Page 52: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

E5 Event

E 77 Persistent Item

E2 Temporal Entity

E22 Man-Made Object

E4 Period

E73 Information Object

E18 Physical Thing

E57Material

E55 Type

E70 Thing

E28 Conc.Object

E55 Appellation

E1 CRM Entity

ATT FacetsACTIVITIES Disciplines

Events

Functions

…..

AGENTS Organizations

People

MATERIALS Materials

OBJECTS Components

Containers

Costume

…….

PHYSICAL ATTR.Attr. & Properties

Color

….

STYLES & PERIODS Styles & Periods

ASSOC. CONCEPTS Assoc. Concepts

CIDOC CRM / AAT mapping

Page 53: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesauri and Classification: About Facets

Aspects of analysis, “minor facets”:

What Ranganathan meant.

e.g. MDA archeological thesaurus: armour by construction : scale armour

armour by form : cuirass armour by function : parade armour

A striking example for explicit use of aspect: SHIC Social, Historical and Industrial Classification a “pure”, homogeneous thesaurus of human activities used by British museums to classify artifacts !

Use to clarify discriminant kind of criteria of concept definition.

53

Page 54: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesauri and Classification: Minor Facets in the AAT

The “Object” Facets (1998 edition) contains: About 1640 facet indicators, About 600 with explicit criteria (“by form etc..”) Using 150 criteria

Frequency of explicit criteria: Form: 35%, function: 30%, placement: 15%, construction: 15%,

social context: 5%… Conclusion:

Minor facets need not be idiosyncratic Facet criteria form hierarchies under fundamental categories

54

Page 55: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 201355

Example of three overlapping facets

objects

swords

sword-like objects

foils (swords)

weapons

sword-likeFighting and hunting

cutting and thrusting

fencing

cutting and thrusting weapons

Fencing swords

Wooden swords

Wooden

Term specialization

Criteria assignment

Page 56: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesauri structure

Uses of facet analysis and hierarchy

Help to organize the concept space and establish relationships Discover concepts, especially general concepts spanning several disciplines

Assist the user in analyzing and clarifying a search problem: Elicit the facets involved Present hierarchical structure within each facet

Facilitate the search for general concepts such as Inflammation or Dependence (which occurs in the context of medicine, psychology and social relation)

Hierarchic query term expansion These functions are useful in both

controlled vocabulary and free-text searching.

56

Page 57: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Thesauri structure

Concept discovery through facet analysis and hierarchy buildingThrough facet analysis and hierarchy building, the lexicographer often discovers concepts that

are needed in searching or that enhance the logic of the concept hierarchy; he then needs to create terms for these concepts.

Considertrain station, bus station, harbor, airport

Common semantic component: traffic station

gin, whiskey, cherry brandy, tequila, etc. Common semantic component: distinct distilled spirits

(counterpart of the already lexicalized neutral distilled spirits)

transactional analysis, dream analysis, insight therapy, Gestalt therapy, reality therapy, cognitive therapy

Umbrella concept for structuring the hierarchy and for retrieval: analytic psychotherapy

(methods that seek to assist patients in a personality reconstruction through insight into their inner selves)

57

Page 58: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Examples, AATArt and Architecture Thesaurus (AAT) Top-level facets (1) Associated Concepts: contains abstract concepts and phenomena that relate to the study

and execution of a wide range of human thought and activity, including architecture and art in all media, as well as related disciplines. Also covered here are theoretical and critical concerns, ideologies, attitudes, and social or cultural movements (e.g., beauty, balance, connoisseurship, metaphor, freedom, socialism).

Physical Attributes: This facet concerns the perceptible or measurable characteristics of materials and artifacts as well as features of materials and artifacts that are not separable as components. Included are characteristics such as size and shape, chemical properties of materials, qualities of texture and hardness, and features such as surface ornament and color (e.g., strapwork, borders, round, waterlogged, brittleness).

Styles and Periods: This facet provides commonly accepted terms for stylistic groupings and distinct chronological periods that are relevant to art, architecture, and the decorative arts (e.g., French, Louis XIV, Xia, Black-figure, Abstract Expressionist).

Agents: The Agents facet contains terms for designations of people, groups of people, and organizations identified by occupation or activity, by physical or mental characteristics, or by social role or condition (e.g., printmakers, landscape architects, corporations, religious orders). Animals and plants are also gradually being added to the Living Organisms hierarchy of this facet.

58

Page 59: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Examples, AATArt and Architecture Thesaurus (AAT) Top-level facets (2) Activities: encompasses areas of endeavor, physical and mental actions, discrete

occurrences, systematic sequences of actions, methods employed toward a certain end, and processes occurring in materials or objects. Activities may range from branches of learning and professional fields to specific life events, from mentally executed tasks to processes performed on or with materials and objects, from single physical actions to complex games.

Materials: deals with physical substances, whether naturally or synthetically derived. These range from specific materials to types of materials designed by their function, such as colorants, and from raw materials to those that have been formed or processed into products that are used in fabricating structures or objects.

Objects: It is the largest of all the AAT facets. It encompasses those discrete tangible or visible things that are inanimate and produced by human endeavor; that is, that are either fabricated or given form by human activity. These range, in physical form, from built works to images and written documents. They range in purpose from utilitarian to the aesthetic. Also included are landscape features that provide the context for the built environment.

Brand Names: A recently added facets that allow additions from the conservation community, particularly where a material or process does not have a generic name.

59

Page 60: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Examples, AATArt and Architecture Thesaurus (AAT) Top-level facets (3)

60

Page 61: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 61

Examples, AAT

Page 62: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Examples, CRISATEL

62

Materials

Evidence of technique mark and trace

Diagnostic examination

Alteration

Intervention

<materials by function>

<materials by composition>

<materials by origin>

<materials by form before

use>

painting material

framing material

conservation restoration material

supporting material

coating material

organic material

inorganic material

compound material

Plant origin material

mineral origin material

synthetic material

animal origin material

solid material

Liquid, paste or soluble material

surface preparation material

inserted material

pasting material

paint layer material binding media

colorant

pigment

dye

Page 63: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Examples, CRISATEL

63

colorant

pigment

dye

black pigment

animal origin pigment

lake pigment

red pigment

yellow pigment

inorganic pigment

blue pigment

white pigment

brown pigment

violet pigment

inert pigment

mineral origin pigment

green pigment

synthetic pigment

organic pigment

plant origin pigment

carmine

artificial ultramarine blue

natural ultramarine blue

ultramarine blue

smalt

indigo

Page 64: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Examples, CRISATEL

64

<paint layer application by visual effect>

<painting techniques by method>

<painting techniques by binding media>

<painting techniques by binding media

application method>

Materials

Evidence of technique mark and trace

Diagnostic examination

Alteration

Intervention

Painting technique

Framing technique

Coating technique

Support manufacturing technique

Trace or mark

Painting portion or component

painting technique without binding media

Wax painting

painting technique with application of mixed binding media and pigment

painting technique with application of binding media before pigment

painting technique with application of binding media after pigment

oil painting

tempera

watercolor

Synthetic medium painting

Top level Facets Hierarchies

Page 65: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013

Examples, Polemon

65

Υλικά

Δραστηριότητες

Δράστες

Τεχνοτροπίες και περίοδοι

Φυσικά Χαρακτηριστικά

Κινητά

Σχετιζόμενες Έννοιες

Τόπος

Top level Facets Hierarchies

Προστασία

Άτομο - Ειδικότητα

Οργανισμοί

Υλικά

Είδη Μνημείου

Τοπωνύμια

Τόπος

Ιστορικοί περίοδοι

Μορφολογία τεχνοτροπία

Page 66: 1 Thesaurus Building Martin Doerr Center for Cultural Informatics Institute of Computer Science Foundation for Research and Technology - Hellas Athens

Thesauri

FORTH-ICS June 17, 2013 66