tthornton code4lib

23
EAD without XSLT a practical approach to archival finding aids Trevor Thornton Senior Applications Developer, NYPL Labs The New York Public Library

Upload: trevorthornton

Post on 09-Jul-2015

922 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tthornton code4lib

EAD without XSLTa practical approach to archival finding aids

Trevor Thornton

Senior Applications Developer, NYPL Labs

The New York Public Library

Page 2: Tthornton code4lib

Project goals

• Enable multiple presentations of

the same data

• Support dynamic web applications

• Cross-collection search with

component-level specificity in

results, and faceting on common

access points

Page 3: Tthornton code4lib

System overview

Ruby on Rails

+ MySQL

+ SOLR

Key functionality:

Data Import

Search index

API

Page 4: Tthornton code4lib

Core models

Page 5: Tthornton code4lib

Collection model

Each collection:

•must have onedescription

•may have one or more components

•may be associated withone or more access terms

Page 6: Tthornton code4lib

Component model

Each component:

•must belong to one collection

•must have one description

•may have one parent component

•may have one or morechild components

•may be associated withone or more access terms

Page 7: Tthornton code4lib

Component hierarchy attributes

• collection_id (id of root collection)

• parent_id (id of parent component)

• sib_seq (sibling sequence)

• level_num (numeric level within hierarchy)

• level_text (series, sub-series, file, etc.)

• has_children

• max_levels

• top_component_id

Computed after initial data import; provided

as a convenience for finding aid UIs and to

streamline formulation of API responses

Page 8: Tthornton code4lib

Description model

Elements of description organized (roughly) based on ISAD(G):

•Descriptive identityISAD(G) 3.1

•ContextISAD(G) 3.2.1 - 3.2.3

•Acquisition & processingISAD(G) 3.2.4, 3.3.2-3.3.3

•Content and structureISAD(G) 3.3.1, 3.3.4

•Access and useISAD(G) 3.4

•Related materialISAD(G) 3.5

•NotesISAG(G) 3.6

Page 9: Tthornton code4lib

Description model: basic EAD mapping

Page 10: Tthornton code4lib

Description model: JSON format{ "unitid": [ { "value": "3283", "type": "local_mss" } ], "unittitle": [ { "value": "David Ames Wells papers" } ], "unitdate": [ { "type": "inclusive", "normal": "1847/1895", "value": "1847-1895" } ], "physdesc_extent":[ { "value": ".5 linear feet", "unit":"linear feet" }, { "value": "2 boxes", "unit":"containers" } ], "abstract": [ { "value": "David Ames Wells was an engineer, economist, textbook author, and advocate for lower tariff rates. This collection contains correspondence with Gordon L. Ford, Worthington C. Ford, and others; clippings; a manuscript draft of Protection: The Poor Man's Friend; and a lecture Wells delivered on free trade in 1882"} ], "prefercite": [ { "value": "<p>David Ames Wells papers, Manuscripts and Archives Division, The New York Public Library</p>" } ]}

Page 11: Tthornton code4lib

EAD as a guide for data storage

• EAD elements that allow only CDATA are stored as

plain strings

• EAD elements that require content to be structured in

<p> or other block elements stored as HTML

• Rules established for converting EAD to HTML

when necessary

• HTML conversion designed to support re-conversion

back to EAD

Page 12: Tthornton code4lib

Special handling for dates

• Dates are hardo Inclusive dates and bulk dates

o Multiple date formats

o Ranges, lists and both

• Special data structure for dates:o date_statement (original text)

o inclusive_start / inclusive_end

o bulk_start / bulk_end

o keydate (for ordering query response – earliest inclusive date

or earliest bulk date when present)

o index_dates (for search faceting – every year included in range/list)

Page 13: Tthornton code4lib

Access Term model

Page 14: Tthornton code4lib

Refinement of Access Term/Access Term Association models

Page 15: Tthornton code4lib

Data import

• It’s messy business

• Bulk of work has focused on EAD;Nokogiri used extensively for parsing XML

• Basic process for EAD import:1. Create collection record

2. Extract collection-level data,create/save description

3. Extract access terms, and for eacha. Save if it doesn’t already existb. Save collection/term association

4. Extract top-level components, and for each:a. Create component recordb. Extract component-level data,

create/save descriptionc. Extract/save access terms & associationsd. Extract child components and repeat for each

Page 16: Tthornton code4lib

Integration with NYPL digital repository

• Fedora repository

+ custom metadata creation/digitization workflow system

+ API to query repository data

• All records in repository identified with UUID

• UUID of digital object associated with a given component

is stored locally in archives data system

• Best case scenario: common identifiers appear in

archival description and in Fedora

Page 17: Tthornton code4lib

Apache Solr

• Inter- and intra-collection search

• Collocation via faceting and filter queries

• Using RSolr to facilitate interaction with Solr

(for both search and index)

Page 18: Tthornton code4lib

API

• API development is proceeding in step with finding aid

development – available requests added as needed

• Basic requests:

o Collection-level data

o Components of a collection,

or sub-components of a componento Includes all component-level descriptive datao Max. depth can be specified

o Digital assets associated with

a component

Page 19: Tthornton code4lib

Finding aid prototype

Page 20: Tthornton code4lib

Finding aid prototype

Page 21: Tthornton code4lib

Front-end system overview

Page 22: Tthornton code4lib

Considerations for future development

• Separate API from data management?

o Data management app to handle all create/update/destroy

operations, while API (Sinatra?) is read-only

o Open API to public? Security/load considerations…

• ArchivesSpace

o NYPL is considering it as a possible replacement for

our existing ‘home-grown’ system

o How would this system integrate with ArchivesSpace API?

• Upcoming EAD revision

Page 23: Tthornton code4lib

some code to look at and/or borrow from:

github.com/nypl/archives_data_public

finding aid prototype:

archives.nypl.org

me:

[email protected]

NYPL Labs:

nypl.org/labs