murtha baca

Post on 12-May-2015

2.429 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Using Controlled Vocabularies to Enhance Access to Cultural Information

TRANSCRIPT

SLA SeattleSLA Seattle

June 16, 2008June 16, 2008

SLA SeattleSLA Seattle

June 16, 2008June 16, 2008

“Seek, and ye shall find:”

Using Controlled Vocabularies to Enhance Access

to Cultural Information

Murtha BacaMurtha Baca

Head, Vocabulary ProgramHead, Vocabulary Program

Getty Research InstituteGetty Research Institute

SLA SeattleSLA Seattle

June 2008June 2008

Murtha BacaMurtha Baca

Head, Vocabulary ProgramHead, Vocabulary Program

Getty Research InstituteGetty Research Institute

SLA SeattleSLA Seattle

June 2008June 2008

Controlled Vocabularies:an Overview

TYPOLOGY of DATA STANDARDS

Data structure standards (metadata element sets):MARC, EAD, Dublin Core, CDWA, VRA Core

Data content standards (cataloging rules):AACR (RDA), ISBD, CCO, DA:CS

Data value standards (vocabularies):LCSH, LCNAF, TGM, AAT, ULAN , TGN, MeSH

Data format standards (standards expressed in machine-readable form):MARC, MARCXML, EAD, CDWA Lite XML, Dublin

Core Simple XML schema, DC Qualified XML schema, VRA Core XML schema

What are vocabularies?

• Maps to guide people to information– creating / filling– searching / researching– organizing / classifying / thinking

• Collections of terminology where relationships between terms are represented

• Data value standards (i.e. what is used to “fill” metadata elements/categories or “containers” of information)

“Knowledge bases” -- bodies of knowledge represented by language (glossaries, dictionaries, thesauri, word lists)

What are vocabularies?

Types of terms in vocabularies

personal names: Collate, Charles B. geographic names: Campbeltown

(Argyll and Bute, Scotland, UK) object names: clack valve corporate names: Cambrian Railways iconographic subjects and themes:

The Legend of John Henry genre terms: political cartoons, fish

stories multilingual equivalents: flat car

(English) = Schienenwagen (German) = platforma (atklata) (Latvian)

What is a controlled vocabulary?

A tool for consistency in the language used in the recording and retrieval of information

What is a controlled vocabulary?

An organized arrangement of words and phrases that are used to index content and/or to retrieve content through navigation or a search

Typically a vocabulary that includes preferred terms and has a limited scope or describes a specific domain

Types of Controlled

Vocabularies

Controlled Lists

Simple lists of terms used to control terminology

In a well-constructed controlled list: Each term must be unique (no homographs). Terms should all be members of the same

class. Terms should not be overlapping in meaning Terms should be equal in granularity or

specificity. Terms are arranged alphabetically or in

another logical order.

Controlled Lists cont. May include terms from other controlled vocabulary resources (especially standard published vocabularies)

For some elements or fields in a database, a controlled list may be sufficient to control terminology, particularly where the terminology for that field is limited and unlikely to have synonyms or ancillary information. (Example: artists’ roles in ULAN, place types in TGN).

Controlled list: A simple list of terms used to control terminology

manuscriptsmiscellaneouspaintingsphotographssculpturesite Installationtextsvessels

Example of a controlled pick list for Classification

Patricia Harpring, 2008 Patricia Harpring, 2008 © J. Paul Getty Trust

A list comprising sets of terms that are considered equivalent

No preferred term Generally used for search and retrieval, providing

access to content that is represented in natural, uncontrolled language

Felis domesticus

Synonym ring list

Jean-Baptiste Perroneau, Portrait of Magdaleine Pinceloup, © J. Paul Getty Museum; Chat Noir, Theophile-Alexandre Steinlen, © Sta. Barbara Museum of Art. Egyptian Cat, © Metropolitan Museum. Cat and Kittens, © National Gallery of Art. Maneki Neko, Japanese, © private collection.

© J. Paul Getty Trust; Patricia Harpring 2008

domestic catcat

Felis catushouse cat

Compilations, usually in alphabetical order, that combine separate concepts into a “string,” as in the Library of Congress Subject Headings (LCSH)

Commercial fishing -- Japanese competition

Salmon fisheries -- law and legislation -- California

Subject Headings

Pre-coordination of terminology is a characteristic of subject headings; subject headings typically combine several unique concepts together.

Subject Headings cont.

Subject headings--Pictures.

Pictures--Computer network resources.

World Wide Web--Subject access.

Taxonomies/Classifications

Vocabularies that organize a body of knowledge for a defined domain into conceptual categories, e.g. Nomenclature for Museum Cataloging, ICONCLASS.The Greek heroic legends Story of Hercules (Heracles) Labors of Hercules

Hercules chokes the Nemean lionHercules kills the Hydra of LernaHercules captures the Ceryneian hindHercules captures the Cretan bull

http://www.iconclass.nl/

Compilations of terms representing single concepts. Thesauri explicitly express relationships among terms via a semantic structure.

<visual works by form>dioramasdiptychsmedals

medallions (medals)polyptychstriptychs

Thesauri

Authority Files• Compilations of authorized terms or

headings used by a single information system, organization, or consortium for cataloging, indexing, and documentation.

• Main purpose is to regulate usage.• Include synonyms (“See” references) and

related or associated terms (“See also” references).

• Examples include Library of Congress Name Authority File (LCNAF), local authorities for names, subjects, etc.

• Authority files may take the form of thesauri, word lists, etc.—in other words, any kind of vocabulary can be used as an authority.

More on Thesauri

Thesauri

Terms in a thesaurus may have the following three types of relationships:

Equivalence Hierarchical Associative

Thesaural Relationships

• Equivalence– synonyms, spelling variations,

language variations

• Hierarchical– broader to narrower

•whole/part•genus/species

• Associative– related concepts

Equivalence Relationship: Terms/names

denote the same thing—a preferred name is used for

displays

Bulgarini, Bartolomeo (Sienese painter, circa

1337-1378) Lorenzetti, Ugolino Master of the Ovile Madonna Ovile Master example from example from

ULANULAN

Equivalence Relationship

still lifesstill lifestill-lifes still lives nature morte natura morta stilleven Stilleben vie coye ontbijtjebanketje

Whole/Part Relationship: “children” or narrower terms are part of the

parent or broader term

España..........................(nation) Andalucía.......................

(region) Almería.........................

(province) Cádiz...........................

(province) Córdoba.........................

(province) Granada.........................

(province) Huelva..........................

(province) Málaga..........................

(province) Sevilla.........................

(province)

Genus/Species Relationship: “children” represent types of the “parent” or broader term

funerary sculpture brasses effigies gisants... haniwa tomb slabs ushabti

Associative Relationship: terms are related conceptually, but not

necessarily hierarchically

Descriptor: charterhouses Hierarchy: Built Complexes and DistrictsScope note - Carthusian monasteries. Alternate Forms of Speech {ALT}: charterhouse Synonyms and spelling variants {UF}: certose charter houses chartreuses Related concepts: Carthusian (Religions hierarchy)

indexer thesaurus: A thesaurus designed to control terminology and guide indexers in the choice of terms. See also end-user thesaurus.indexing: Also called human indexing. The process of evaluating information and designating indexing terms by using controlled vocabulary that will aid in finding and accessing the cultural work record. Refers to indexing done by human labor, not to the automatic parsing of data into a database index, which is used by a system to speed up search and retrieval.

indexer thesaurus

A thesaurus designed for direct access by searchers rather than for use by indexers. Instead of controlling the terminology, the purpose of an end-user thesaurus is to help searchers find useful terminology for improving, narrowing, and broadening their queries.

end-user thesaurus

A vocabulary constructed with the goal of being interoperable with an existing vocabulary, e.g. a specialty vocabulary such as a conservation thesaurus that is intended to be linked to the superstructure of a larger vocabulary, such as the AAT.

satellite vocabulary

Vocabularies provide

intellectual “paths” that can improve access to information

Harlem Renaissance

Negro Renaissance

New Negro Movement

Renaissance, Harlem

Renaissance, NegroJacob Lawrence Tombstones, 1942

Example from the Example from the AATAAT

Why do we need vocabularies?

• Because of national and regional differences: lorries vs. trucks, lifts vs. elevators, Tom Thumb golf courses vs. miniature golf courses

• Because of historical vs. contemporary names: Iran vs. Persia vs. Islamic Republic of Iran

• Because of political and social changes: KhoiKhoi vs. Hottentot

• Because of linguistic differences: Titian vs. Tiziano vs. Titien; pottery vs. keramik vs. céramique

• To disambiguate homographs: sinopia (pigment -- Materials hierarchy) vs. sinopia (preliminary drawing -- Visual Works hierarchy)

Why do we need vocabularies?

Thesaural relationships provide greater research/searching capabilities:

drawings<drawings by function>

preliminary drawingsunderdrawings

sinopie

Issues in vocabulary-enhanced searching

• User interfaces are problematic• Optimally, controlled vocabularies

should be used both on the “back end” and on the “front end” to be most effective

• Economics: consistent implementation of controlled vocabularies is time- and labor-intensive

• Vocabulary control is almost non-existent on the Web at present

Search “ARES” Against Getty Web site

“ARES” did not match any pages

Improve recall by ORing equivalent names (Ares, Mars)

“Ares OR Mars” now retrieves 37 pages

Search “ARES” Against Google (returns 1,250,000 pages; none

of first 6 pages are relevant)

Increase precision by ANDing the broader/parent term of ARES, “Major

Gods”

“Ares AND Major Gods” now narrow to 506 hits (all first 7 pages are

relevant)

Recall and Precision

Note that when searching “Ares” against the Getty site, it retrieves nothing. So we need to include synonyms/equivalents (OR “Mars”) to improve recall. When performing the same search against Google, however, it returns too many hits. So we need to combine the broader term (AND “Major Gods”) to improve precision. This illustrates how important it is for a retrieval system to be flexible and let the user decide how to refine the search according to specific situations.

Examples of standards for data values: The Getty Vocabularies Library of Congress Name Authority

File (LCNAF) Library of Congress Subject Headings

(LCSH) ICONCLASS

The Getty Vocabularies

The Getty Vocabularies

The Getty The Getty VocabulariesVocabularies

Compiled and maintained by the Getty Vocabulary Program

Union List of Artist Names® (ULAN) 117,600 ‘records’; 257,241 names

Art & Architecture Thesaurus® (AAT) 33,150 ‘records’; 128,075 terms

Getty Thesaurus of Geographic Names® (TGN)911,300 ‘records,’1,102,200 names

Focus on the visual arts, architecture, & material culture Are compiled resources (not comprehensive) Grow through contributions May be licensed (vendors of collection management systems, others)

http://www.getty.edu/research/conducting_research/vocabularies/

Controlled vocabularies:Why bother?

Αγία Σοφία

Ayasofya

Church of the Holy Wisdom

Hagia Sophia

Haghia Sophia

Saint Sophia

Sancta Sophia

St. Sophia

Constantinople

Constantinopolis

Costantinopoli

Estambul

Istanbul

Konstantinopel

New Rome

Mikligard

Tsargrad

Tsarigrad

names from Getty Thesaurus of Geographic Names (TGN)

deposit slip/deposit ticket =paying-in slip

confirmation chit = receipt, deposit receipt

= cargo shorts

= board shorts

desk?

cartonnier?

chest?

cabinet?

dolls?

figurines?

statuettes?

idols?

carvings?

sculptures?

Giambologna?

Giovanni da Bologna?

Jean de Boulogne?

• Users may call the same artist by various names• Items have been catalogd using different names for the same artist

• published misspellings provide access points

NAMES:O’Keeffe, Georgia Georgia O’KeeffeO’Keefe, Georgia Stieglitz, Alfred, Mrs.

Georgia O'KeefeRam's Skull With Brown LeavesRoswell Museum and Art CenterRoswell, New Mexicofrom: http://www.roswellmuseum.org/

Common misspellings

Anonymous artist, later named

• former appellations• name is now known

NAMES:Bulgarini, BartolomeoBartolomeo BolgariniBulgarini da Siena,

BartolommeoLorenzetti, UgolinoMaster of the Ovile

MadonnaOvile Master

The Crucifixion, mid 1300s, tempera on wood,The Hermitage (St. Petersburg, Russia)

image from http://sunserv.kfki.hu/~arthp/html/l/lorenzet/ugolino/index.html

Database issues

• repeating vs. non-repeating fields

• vocabulary-controlled vs. free-text fields (for indexing vs. display)

• “built-in thesauri”; vocabulary-assisted searching OR

• addition of broader terms, variants, at record level

If we use terms from a standard source such as LCSH or the AAT, why do we need our own “local” authority file(s)?

Why do we need local authorities?

Local authorities can provide terms not found in published authorities, including non-expert and even “wrong” terms and names.An authority record can remind the cataloger/indexer/abstractor of policies regarding local usage of the term. An authority record can contain relevant/appropriate variant names for the term and identify the one that is preferred and used by the project or institution.

What about social tagging

and folksonomies?

In the context of the Web, the act of associating terms (called “tags”) with an information object (e.g. a Web page, an image, a streaming video clip), thus describing the item and enabling keyword-based classification and retrieval. Tags – a form of user-generated metadata – from communities of users can be aggregated and analyzed, providing useful information about the collection of objects with which the tags have been associated.

tagging

The decentralized practice and method by which individuals and groups create, manage, and share terms, names, etc. (called “tags”) to annotate and categorize digital resources in an online “social” environment. A folksonomy is the result of social tagging. Also referred to as collaborative tagging, social classification, social indexing, mob indexing, folk categorization.

social tagging

An orderly classification that explicitly expresses the relationships, usually hierarchical (e.g., genus/species, whole/part, class/instance), between and among the things being classified.

taxonomy

An assemblage of concepts, represented by terms and names (called “tags”), the result of social tagging. A folksonomy is not a taxonomy.

folksonomy

Vocabularies in the Corporate World

Disney Titles (preferred forms)

• One Hundred and One Dalmatians

• 101 Dalmatians• 101 Dalmatians II: Patch’s

London Adventure• 101 Dalmatians, Disney’s• 101 Dalmatians: Escape from De

Vil Manor• Sing Along Songs: Disney’s: 101

Dalmatians – Pongo & Perdita• 101 Dalmatians Holiday Art

Disney Variants

One Hundred and One DalmatiansOne Hundred and One Dalmations One Hundred and One Dalmatians (animated)101 Dalmations (animated feature film)101 Dalmations

101 DalmatiansOne Hundred and One Dalmatians (live action)One Hundred and One Dalmations (live action)One Hundred and One Dalmations (live-action

feature)101 Dalmations

What’s in a name? That which we call a rose

By any other name would smell as sweet.

Shakespeare, Romeo and Juliet, Act II, scene ii

Murtha BacaMurtha Baca

Head, Vocabulary ProgramHead, Vocabulary Program

Getty Research InstituteGetty Research Institute

mbaca@getty.edu

http://getty.edu/research/conducting_research/http://getty.edu/research/conducting_research/vocabularies/vocabularies/

top related