murtha baca
DESCRIPTION
Using Controlled Vocabularies to Enhance Access to Cultural InformationTRANSCRIPT
SLA SeattleSLA Seattle
June 16, 2008June 16, 2008
SLA SeattleSLA Seattle
June 16, 2008June 16, 2008
“Seek, and ye shall find:”
Using Controlled Vocabularies to Enhance Access
to Cultural Information
Murtha BacaMurtha Baca
Head, Vocabulary ProgramHead, Vocabulary Program
Getty Research InstituteGetty Research Institute
SLA SeattleSLA Seattle
June 2008June 2008
Murtha BacaMurtha Baca
Head, Vocabulary ProgramHead, Vocabulary Program
Getty Research InstituteGetty Research Institute
SLA SeattleSLA Seattle
June 2008June 2008
Controlled Vocabularies:an Overview
TYPOLOGY of DATA STANDARDS
Data structure standards (metadata element sets):MARC, EAD, Dublin Core, CDWA, VRA Core
Data content standards (cataloging rules):AACR (RDA), ISBD, CCO, DA:CS
Data value standards (vocabularies):LCSH, LCNAF, TGM, AAT, ULAN , TGN, MeSH
Data format standards (standards expressed in machine-readable form):MARC, MARCXML, EAD, CDWA Lite XML, Dublin
Core Simple XML schema, DC Qualified XML schema, VRA Core XML schema
What are vocabularies?
• Maps to guide people to information– creating / filling– searching / researching– organizing / classifying / thinking
• Collections of terminology where relationships between terms are represented
• Data value standards (i.e. what is used to “fill” metadata elements/categories or “containers” of information)
“Knowledge bases” -- bodies of knowledge represented by language (glossaries, dictionaries, thesauri, word lists)
What are vocabularies?
Types of terms in vocabularies
personal names: Collate, Charles B. geographic names: Campbeltown
(Argyll and Bute, Scotland, UK) object names: clack valve corporate names: Cambrian Railways iconographic subjects and themes:
The Legend of John Henry genre terms: political cartoons, fish
stories multilingual equivalents: flat car
(English) = Schienenwagen (German) = platforma (atklata) (Latvian)
What is a controlled vocabulary?
A tool for consistency in the language used in the recording and retrieval of information
What is a controlled vocabulary?
An organized arrangement of words and phrases that are used to index content and/or to retrieve content through navigation or a search
Typically a vocabulary that includes preferred terms and has a limited scope or describes a specific domain
Types of Controlled
Vocabularies
Controlled Lists
Simple lists of terms used to control terminology
In a well-constructed controlled list: Each term must be unique (no homographs). Terms should all be members of the same
class. Terms should not be overlapping in meaning Terms should be equal in granularity or
specificity. Terms are arranged alphabetically or in
another logical order.
Controlled Lists cont. May include terms from other controlled vocabulary resources (especially standard published vocabularies)
For some elements or fields in a database, a controlled list may be sufficient to control terminology, particularly where the terminology for that field is limited and unlikely to have synonyms or ancillary information. (Example: artists’ roles in ULAN, place types in TGN).
Controlled list: A simple list of terms used to control terminology
manuscriptsmiscellaneouspaintingsphotographssculpturesite Installationtextsvessels
Example of a controlled pick list for Classification
Patricia Harpring, 2008 Patricia Harpring, 2008 © J. Paul Getty Trust
A list comprising sets of terms that are considered equivalent
No preferred term Generally used for search and retrieval, providing
access to content that is represented in natural, uncontrolled language
Felis domesticus
Synonym ring list
Jean-Baptiste Perroneau, Portrait of Magdaleine Pinceloup, © J. Paul Getty Museum; Chat Noir, Theophile-Alexandre Steinlen, © Sta. Barbara Museum of Art. Egyptian Cat, © Metropolitan Museum. Cat and Kittens, © National Gallery of Art. Maneki Neko, Japanese, © private collection.
© J. Paul Getty Trust; Patricia Harpring 2008
domestic catcat
Felis catushouse cat
Compilations, usually in alphabetical order, that combine separate concepts into a “string,” as in the Library of Congress Subject Headings (LCSH)
Commercial fishing -- Japanese competition
Salmon fisheries -- law and legislation -- California
Subject Headings
Pre-coordination of terminology is a characteristic of subject headings; subject headings typically combine several unique concepts together.
Subject Headings cont.
Subject headings--Pictures.
Pictures--Computer network resources.
World Wide Web--Subject access.
Taxonomies/Classifications
Vocabularies that organize a body of knowledge for a defined domain into conceptual categories, e.g. Nomenclature for Museum Cataloging, ICONCLASS.The Greek heroic legends Story of Hercules (Heracles) Labors of Hercules
Hercules chokes the Nemean lionHercules kills the Hydra of LernaHercules captures the Ceryneian hindHercules captures the Cretan bull
http://www.iconclass.nl/
Compilations of terms representing single concepts. Thesauri explicitly express relationships among terms via a semantic structure.
<visual works by form>dioramasdiptychsmedals
medallions (medals)polyptychstriptychs
Thesauri
Authority Files• Compilations of authorized terms or
headings used by a single information system, organization, or consortium for cataloging, indexing, and documentation.
• Main purpose is to regulate usage.• Include synonyms (“See” references) and
related or associated terms (“See also” references).
• Examples include Library of Congress Name Authority File (LCNAF), local authorities for names, subjects, etc.
• Authority files may take the form of thesauri, word lists, etc.—in other words, any kind of vocabulary can be used as an authority.
More on Thesauri
Thesauri
Terms in a thesaurus may have the following three types of relationships:
Equivalence Hierarchical Associative
Thesaural Relationships
• Equivalence– synonyms, spelling variations,
language variations
• Hierarchical– broader to narrower
•whole/part•genus/species
• Associative– related concepts
Equivalence Relationship: Terms/names
denote the same thing—a preferred name is used for
displays
Bulgarini, Bartolomeo (Sienese painter, circa
1337-1378) Lorenzetti, Ugolino Master of the Ovile Madonna Ovile Master example from example from
ULANULAN
Equivalence Relationship
still lifesstill lifestill-lifes still lives nature morte natura morta stilleven Stilleben vie coye ontbijtjebanketje
Whole/Part Relationship: “children” or narrower terms are part of the
parent or broader term
España..........................(nation) Andalucía.......................
(region) Almería.........................
(province) Cádiz...........................
(province) Córdoba.........................
(province) Granada.........................
(province) Huelva..........................
(province) Málaga..........................
(province) Sevilla.........................
(province)
Genus/Species Relationship: “children” represent types of the “parent” or broader term
funerary sculpture brasses effigies gisants... haniwa tomb slabs ushabti
Associative Relationship: terms are related conceptually, but not
necessarily hierarchically
Descriptor: charterhouses Hierarchy: Built Complexes and DistrictsScope note - Carthusian monasteries. Alternate Forms of Speech {ALT}: charterhouse Synonyms and spelling variants {UF}: certose charter houses chartreuses Related concepts: Carthusian (Religions hierarchy)
indexer thesaurus: A thesaurus designed to control terminology and guide indexers in the choice of terms. See also end-user thesaurus.indexing: Also called human indexing. The process of evaluating information and designating indexing terms by using controlled vocabulary that will aid in finding and accessing the cultural work record. Refers to indexing done by human labor, not to the automatic parsing of data into a database index, which is used by a system to speed up search and retrieval.
indexer thesaurus
A thesaurus designed for direct access by searchers rather than for use by indexers. Instead of controlling the terminology, the purpose of an end-user thesaurus is to help searchers find useful terminology for improving, narrowing, and broadening their queries.
end-user thesaurus
A vocabulary constructed with the goal of being interoperable with an existing vocabulary, e.g. a specialty vocabulary such as a conservation thesaurus that is intended to be linked to the superstructure of a larger vocabulary, such as the AAT.
satellite vocabulary
Vocabularies provide
intellectual “paths” that can improve access to information
Harlem Renaissance
Negro Renaissance
New Negro Movement
Renaissance, Harlem
Renaissance, NegroJacob Lawrence Tombstones, 1942
Example from the Example from the AATAAT
Why do we need vocabularies?
• Because of national and regional differences: lorries vs. trucks, lifts vs. elevators, Tom Thumb golf courses vs. miniature golf courses
• Because of historical vs. contemporary names: Iran vs. Persia vs. Islamic Republic of Iran
• Because of political and social changes: KhoiKhoi vs. Hottentot
• Because of linguistic differences: Titian vs. Tiziano vs. Titien; pottery vs. keramik vs. céramique
• To disambiguate homographs: sinopia (pigment -- Materials hierarchy) vs. sinopia (preliminary drawing -- Visual Works hierarchy)
Why do we need vocabularies?
Thesaural relationships provide greater research/searching capabilities:
drawings<drawings by function>
preliminary drawingsunderdrawings
sinopie
Issues in vocabulary-enhanced searching
• User interfaces are problematic• Optimally, controlled vocabularies
should be used both on the “back end” and on the “front end” to be most effective
• Economics: consistent implementation of controlled vocabularies is time- and labor-intensive
• Vocabulary control is almost non-existent on the Web at present
Search “ARES” Against Getty Web site
“ARES” did not match any pages
Improve recall by ORing equivalent names (Ares, Mars)
“Ares OR Mars” now retrieves 37 pages
Search “ARES” Against Google (returns 1,250,000 pages; none
of first 6 pages are relevant)
Increase precision by ANDing the broader/parent term of ARES, “Major
Gods”
“Ares AND Major Gods” now narrow to 506 hits (all first 7 pages are
relevant)
Recall and Precision
Note that when searching “Ares” against the Getty site, it retrieves nothing. So we need to include synonyms/equivalents (OR “Mars”) to improve recall. When performing the same search against Google, however, it returns too many hits. So we need to combine the broader term (AND “Major Gods”) to improve precision. This illustrates how important it is for a retrieval system to be flexible and let the user decide how to refine the search according to specific situations.
Examples of standards for data values: The Getty Vocabularies Library of Congress Name Authority
File (LCNAF) Library of Congress Subject Headings
(LCSH) ICONCLASS
The Getty Vocabularies
The Getty Vocabularies
The Getty The Getty VocabulariesVocabularies
Compiled and maintained by the Getty Vocabulary Program
Union List of Artist Names® (ULAN) 117,600 ‘records’; 257,241 names
Art & Architecture Thesaurus® (AAT) 33,150 ‘records’; 128,075 terms
Getty Thesaurus of Geographic Names® (TGN)911,300 ‘records,’1,102,200 names
Focus on the visual arts, architecture, & material culture Are compiled resources (not comprehensive) Grow through contributions May be licensed (vendors of collection management systems, others)
http://www.getty.edu/research/conducting_research/vocabularies/
Controlled vocabularies:Why bother?
Αγία Σοφία
Ayasofya
Church of the Holy Wisdom
Hagia Sophia
Haghia Sophia
Saint Sophia
Sancta Sophia
St. Sophia
Constantinople
Constantinopolis
Costantinopoli
Estambul
Istanbul
Konstantinopel
New Rome
Mikligard
Tsargrad
Tsarigrad
names from Getty Thesaurus of Geographic Names (TGN)
deposit slip/deposit ticket =paying-in slip
confirmation chit = receipt, deposit receipt
= cargo shorts
= board shorts
desk?
cartonnier?
chest?
cabinet?
dolls?
figurines?
statuettes?
idols?
carvings?
sculptures?
Giambologna?
Giovanni da Bologna?
Jean de Boulogne?
• Users may call the same artist by various names• Items have been catalogd using different names for the same artist
• published misspellings provide access points
NAMES:O’Keeffe, Georgia Georgia O’KeeffeO’Keefe, Georgia Stieglitz, Alfred, Mrs.
Georgia O'KeefeRam's Skull With Brown LeavesRoswell Museum and Art CenterRoswell, New Mexicofrom: http://www.roswellmuseum.org/
Common misspellings
Anonymous artist, later named
• former appellations• name is now known
NAMES:Bulgarini, BartolomeoBartolomeo BolgariniBulgarini da Siena,
BartolommeoLorenzetti, UgolinoMaster of the Ovile
MadonnaOvile Master
The Crucifixion, mid 1300s, tempera on wood,The Hermitage (St. Petersburg, Russia)
image from http://sunserv.kfki.hu/~arthp/html/l/lorenzet/ugolino/index.html
Database issues
• repeating vs. non-repeating fields
• vocabulary-controlled vs. free-text fields (for indexing vs. display)
• “built-in thesauri”; vocabulary-assisted searching OR
• addition of broader terms, variants, at record level
If we use terms from a standard source such as LCSH or the AAT, why do we need our own “local” authority file(s)?
Why do we need local authorities?
Local authorities can provide terms not found in published authorities, including non-expert and even “wrong” terms and names.An authority record can remind the cataloger/indexer/abstractor of policies regarding local usage of the term. An authority record can contain relevant/appropriate variant names for the term and identify the one that is preferred and used by the project or institution.
What about social tagging
and folksonomies?
In the context of the Web, the act of associating terms (called “tags”) with an information object (e.g. a Web page, an image, a streaming video clip), thus describing the item and enabling keyword-based classification and retrieval. Tags – a form of user-generated metadata – from communities of users can be aggregated and analyzed, providing useful information about the collection of objects with which the tags have been associated.
tagging
The decentralized practice and method by which individuals and groups create, manage, and share terms, names, etc. (called “tags”) to annotate and categorize digital resources in an online “social” environment. A folksonomy is the result of social tagging. Also referred to as collaborative tagging, social classification, social indexing, mob indexing, folk categorization.
social tagging
An orderly classification that explicitly expresses the relationships, usually hierarchical (e.g., genus/species, whole/part, class/instance), between and among the things being classified.
taxonomy
An assemblage of concepts, represented by terms and names (called “tags”), the result of social tagging. A folksonomy is not a taxonomy.
folksonomy
Vocabularies in the Corporate World
Disney Titles (preferred forms)
• One Hundred and One Dalmatians
• 101 Dalmatians• 101 Dalmatians II: Patch’s
London Adventure• 101 Dalmatians, Disney’s• 101 Dalmatians: Escape from De
Vil Manor• Sing Along Songs: Disney’s: 101
Dalmatians – Pongo & Perdita• 101 Dalmatians Holiday Art
Disney Variants
One Hundred and One DalmatiansOne Hundred and One Dalmations One Hundred and One Dalmatians (animated)101 Dalmations (animated feature film)101 Dalmations
101 DalmatiansOne Hundred and One Dalmatians (live action)One Hundred and One Dalmations (live action)One Hundred and One Dalmations (live-action
feature)101 Dalmations
What’s in a name? That which we call a rose
By any other name would smell as sweet.
Shakespeare, Romeo and Juliet, Act II, scene ii
Murtha BacaMurtha Baca
Head, Vocabulary ProgramHead, Vocabulary Program
Getty Research InstituteGetty Research Institute
http://getty.edu/research/conducting_research/http://getty.edu/research/conducting_research/vocabularies/vocabularies/