megan dirickson, kristin law, nora winslow inf 392k, spring 2013
TRANSCRIPT
Archiving the Digital Records of the SAA-UT
Student ChapterMegan Dirickson, Kristin Law, Nora Winslow
INF 392K, Spring 2013
Overview
Previous Work Determining Scope Gathering & Assessing records Appraisal & Arrangement
o Creating the DSpace Collectionso Privacy
Processingo Descriptive Metadata Spreadsheeto Creation of the SIPSo Batch Ingesto Shell scriptingo Batch Metadata Editing
Twitter Future Work Self-Archiving Guidelines
Previous Work
In 2011, Wendy Hagenmaier and Rachel Appel digitized SAA paper records for the Survey of Digitization class.
They digitized 221 objects. They set up a basic schema in
DSpace, which we used as jumping-off point.
Existing Schema
Community-School of Information Student Organizations
Sub-community-Society of American Archivist UT Chapter
Collections: Administrative Records, Archives Week, Correspondence, Events, Financial Records, Marketing, Meeting Minutes, Website
Our Goal
Archive all the existing born-digital records, especially the records from the past year.
But more importantly, set up a self-archiving work flow that would allow future SAA members to easily archive their own records into Dspace.
First Things First
We wanted to gain intellectual control over the materials. We asked: “What exists and where is it? What should be included for the future?”
Used Megan and Kristin’s expertise as previous officers
Rachel and Wendy’s previous documentation
Actually Getting the Records
We asked previous SAA board members to send us anything they had.
Gleaned materials from the SAA’s two websites-the general website and the Archives Week website
Type of Records
Images, documents, recordings, presentations and spreadsheets
Files that made up the websites, html and css files mostly
Twitter and Facebook accounts Listserv emails
Narrowing the Scope
Over 600 discrete files Experimented with archiving Twitter
and Facebook-mixed results Looked into previous attempts to
archive listserv emails. Facebook and the emails proved too
complicated and time-consuming for the scope of this project.
Appraisal
Appraisal basically consisted of weeding out duplicates, of which there were a lot.
Kristin managed the files that were sent to us from previous members.
Megan gleaned the general SAA website.
Nora worked with the Archives Week website.
Over 900 files
Restructuring DSpace
Large number of files Over 10 year time span We wanted to maintain the
arrangement, but the current structure was too restrictive.o We moved everything up a level, in
order to create collections for each year
New Structure
Communityo School of Information
Student Organizations. Sub-community
o Society of American Archivists:
o UT Student Chapter Sub-sub-communities
o Administrative Recordso Archives Weeko Correspondence o Events o Financial Recordso Marketingo Meeting Minuteso Website and Social Media
Collectionso Calendar year
Privacy
All Financial Records collections have been set to be private. These collections contain budgets, potential account information, and information about donations to Archives Week that the donors may wish to keep private. All financial documents from Archives Week planning have intentionally been included with Financial Records in order to keep the Archives Week collections open to the public.
The most current years (2010-2013) of Administrative Records are currently closed. Sensitive documents in these collections include membership rosters (with emails), and mentorship program information. EIDs have been redacted from the 2010-2011 membership rosters.
Tips to ensure privacy EIDs are not to be kept in the digital archive and documents
should be reviewed to be sure that they are not included.
Other sensitive information may be included in the archive, but kept in a private collection. All sensitive documents have been included in only Financial Records and Administrative Records, allowing the remaining collections to be open. Titles of private items will be viewable to the public, but the contents of the items will not be.
It is up to the discretion of the future board to determine when the closed collections may be made publicly available. The Treasurer is responsible for reviewing current and previously deposited records for privacy issues, as the Treasurer will be most cognizant of sensitive information contained in financial and membership records.
Processing—metadata gathering
Kept archival copy of records safe on a flash drive
Made other ‘processing’ copies for determining content and gathering metadata
Created spreadsheet for entering descriptive metadata
This is also when we determined intellectual arrangement of records and spotted duplicates
Creation of SIPs
Create extracted metadata xml file using National Library of New Zealand’s Metadata Extractor
Perl script to create dublin_core formatted xml from extracted xml, and create a new directory for each
Manually add original bitstream to each directory Perl script to create ‘contents’ text file Perl script to change directory names to item_001,
item_002, etc. This had to be done separately for each collection
(about 30 collections)
Batch Ingest
Staged SIPs on Vauxhall in structure mirroring the Dspace structure, and wrote batch ingest command lines before meeting with Sam
Change in command line:o /opt/dspace/bin/dspace import
org.dspace.itemimport.ItemImport --add [email protected] --collection=2081/29160 --
Problems with dublin_core files—junk!
Shell Scripting
Since we had so many collections, we bundled the command lines to execute using shell scripts
The idea was to save time…..but…o The script didn’t leave time to check for
errors before moving on to the next collection
Added: echo sleep 5
Batch Metadata Editing Exported metadata
from each sub-community:
idcollectiondc.contributor.authordc.date.createddc.date.issueddc.identifier.uridc.language.isodc.publisherdc.subjectdc.title
Merged with our descriptive metadata files by matching with id #’s, and adding/changing dublin core fields and data:
idcollectiondc.contributor.author – SAA-UT dc.date.created –changed from ingest date, to date of creation/use of documentdc.date.issueddc.identifier.uridc.language.isodc.publisherdc.subjectdc.title.alternative –moved filename heredc.contributor – if an individual author was knowndc.title --changed from filename to descriptive titledc.coverage.spatialdc.description
Batch Metadata Editing Once the spreadsheet was completely edited, we saved
them as CSVs, and met with Sam again to import the metadata
Each sub-community had to be imported individually (much faster than each collection!)
Command line:Opt/dspace/bin/dspace metadata-import –f /opt/batch_ingests/2081-29125.csv
Weird things happened with the ingest date…
Social Media
Twitter provides a simple means for downloading Tweets
We felt that the tweets, especially from 2012, were valuable records. The Archives Week lectures were live-tweeted, providing rich documentation for the events.
The Dspace bundle includes:o Zip file including CSV of tweets (with time/date
stamps)o Screenshot for added visual context
Future Work
Follow workflow and continue archiving records!
Website—too complicated for a simple ingest
Listserv emails Facebook Continued digitization
Self-Archiving Guidelines/Workflow
Naming Conventions and Standards Roles & Responsibilities Basic workflow for importing items
individually to Dspace, including adding descriptive metadata
Security/Access and Privacy Issues Community and Collection structure;
arrangement guidelines for consistency Appraisal/Selection Policies and record
priorities