preserving the h-net e-mail lists: a case study in trusted digital repository assessment lisa m....

42
Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt [email protected] http://www.h-net.org/archive/ MATRIX: The Center for Humane Arts, Letters & Social Sciences Online Michigan State University January 25, 2009

Upload: norah-howard

Post on 26-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital

Repository Assessment

Lisa M. [email protected]://www.h-net.org/archive/

MATRIX: The Center for Humane Arts, Letters & Social Sciences Online

Michigan State UniversityJanuary 25, 2009

Page 2: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Preserving the H-Net E-Mail Lists

• H-Net Background

• Current Preservation Practices

• Use of the Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC)

• Preservation Improvement Plan

Page 3: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

H-Net: Humanities and Social Sciences Online

• International consortium of scholars and teachers

• Oldest collection of born-digital and content-moderated arts, humanities, and social science material on the Internet

• Valuable scholarly resource– More than 180 networks, or e-mail lists– More than 230 “private” lists

• More than 1 million e-mail messages• Hosted by MATRIX

Page 4: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

MATRIX

• Digital humanities research center• Devoted to the application of new

technologies in humanities and social science teaching and research

• Uses Internet technologies to improve education and increase the flow of information

Page 5: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

NHPRC Grant

• Conduct assessment of existing H-Net preservation policies and practices

• Apply OCLC/CRL TRAC checklist • Develop and implement an improved long-

term preservation plan• Useful to those managing large collections of

electronic records• Research semantic clustering search

techniques

Page 6: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

How H-Net Works:Backup & Security

• 3 TB of data, including H-Net• Server rack kept in climate controlled,

physically secured room• Daily incremental backups, weekly full • Monthly full, “permanent” tape backups

Page 7: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

How H-Net Works:Posting Messages

• H-Net runs on LISTSERV Software• Users must be list subscribers to post• Messages written in plain text• No attachments allowed on public lists• Editors approve and post messages• Editors can overwrite creation metadata

Page 8: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

How H-Net Works:Posting Messages

Message Posting Process

Page 9: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

How H-Net Works:Archiving of Lists

• Messages post from a few seconds up to several days after approval

• Messages kept in flat text files called “notebooks”

• Notebook includes messages posted during a weekly time period

Page 10: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

How H-Net Works:Archiving of Lists

Time Period Day of Month

a 1-7

b 8-14

c 15-21

d 22-28

e 29-31

Ex. “h-africa.log0802a”

Page 11: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

How H-Net Works:Archiving of Lists

• BRS Database– Newest notebook messages parsed and copied

every 24 hours– MD5 hashes created for each message– Available for full-text search

• MySQL Database Cache– Key metadata extracted, MD5 hashes created,

written to database cache– Enables more efficient browsing

Page 12: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

How H-Net Works:Archiving of Lists

Message Metadata Stored in MySQL Database

Page 13: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

How H-Net Works:Message Retrieval

http://h-net.msu.edu/cgi-bin/logbrowse.pl?trx=vx&list=H-Albion&month=0808&week=b&msg=w8utW6nKNO1FuY19vSK2mo

&user=&pw=

Page 14: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Current Preservation Practices

Message Ingest, Storage, and Retrieval Processes

Page 15: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Current Preservation Practices

• Backup, but only local—and no true archiving• Significant property: message/notebook

content, stored in plain text formats– No need for XML encoding– Attachments require migration plan

• Authenticity– Informal check by author and/or editor on posting– Broken URL on message retrieval attempt– Cached metadata as PDI

• Reference, Content, Provenance Information• MD5 hashes for message discovery, not fixity• No Fixity Information for notebook files

Page 16: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• Standards for accreditation of archives called for in OAIS Reference Model (January 2002)

• Foundation in RLG/OCLC’s Trusted Digital Repositories: Attributes and Responsibilities (May 2002)– Definition: Repository that provides reliable, long-term access

to managed digital resources– Attributes: OAIS Compliance, Administrative Responsibility,

Organizational Viability, Financial Sustainability, Technological and Procedural Suitability, System Security, Procedural Accountability

– Recommendations on organizational and technological responsibilities

Page 17: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• NARA and RLG Task Force on Digital Repository Certification established 2003

• TRAC 1.0 published by CRL and OCLC, February 2007

Page 18: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Other Certification Methodologies for Trusted Digital Repositories

• Digital Repository Audit Method Based on Risk Assessment (DRAMBORA)http://www.repositoryaudit.eu/

• Network of Expertise in long-term STOrage and long-term availability of digital Resources in Germany (nestor)http://www.langzeitarchivierung.de/index.php

• ISO standard: Digital Repository Audit and Certification Working Group Wikihttp://wiki.digitalrepositoryauditandcertification.org/bin/view/Main/WebHome

Page 19: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• For certification by third party or self assessment– Looks at digital information management system– Understands threats to /risks within system– Constant monitoring, planning, maintenance– Conscious actions / strategy implementations

• Three sections– A. Organizational Infrastructure– B. Digital Object Management– C. Technologies, Technical Infrastructure,

& Security

Page 20: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

• Compare core audit criteria to local capabilities—“Gap Analysis,” illuminating areas requiring improvement

• Formulate strategies to narrow the gap and improve trustworthiness of repository

Page 21: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)• Applicability of criteria

– Varies per context of institution– Not all criteria will apply to all repositories

• Examples of documentation, other evidence– List of minimum required documents that satisfy multiple

requirements (Appendix 3)• Contingency and succession plans• Policies relating to legal permissions• Financial procedures• Procedures related to ingest• Preservation and storage/migration strategies• Policy for access• Disaster plans

Page 22: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Trustworthy Repositories Audit & Certification:

Criteria and Checklist (TRAC)

Page 23: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

TRAC:A. Organizational Infrastructure

• Characteristics of the repository organization that affect performance, accountability, and sustainability– A1. Governance & Organizational Viability– A2. Organizational Structure & Staffing– A3. Procedural Accountability & Policy Framework– A4. Financial Sustainability– A5. Contracts, Licenses, & Liabilities

Page 24: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

TRAC:A. Organizational Infrastructure

• A1. Governance & Organizational Viability– A1.2. “Repository has an appropriate, formal succession

plan, contingency plans, and/or escrow arrangements in place in case the repository ceases to operate or the governing or funding institution substantially changes in scope.”Evidence: succession plan(s); escrow plan(s); explicit and specific statement documenting the intent to ensure continuity of the repository and the steps taken and to be taken to ensure continuity; formal documents describing exit strategies and contingency plans; depositor agreements

– H-Net: No succession plan currently in place– Narrow the gap: Identify, negotiate with, and make

preliminary plans with potential successor; write policy that states intent and describes what’s needed in successor

Page 25: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

TRAC:A. Organizational Infrastructure

• A3. Procedural accountability & policy framework– A3.3. “Repository maintains written policies that specify the

nature of any legal permissions required to preserve digital content over time, and repository can demonstrate that these permissions have been acquired when needed.”Evidence: deposit agreements; records schedule; digital preservation policies; records legislation and policies; service agreements

– H-Net: Policy on Copyright/Intellectual Property, Constitution, By-Laws; authors retain copyright, sending message constitutes granting permission for distribution and (implicit) preservation

– No gap! Restate policies to include explicit permission to preserve messages; link to copyright policies from digital preservation plan

Page 26: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

TRAC:B. Digital Object Management

• Repository functions, processes, and procedures needed to ingest, manage, and provide access to digital objects for the long term– B1. Ingest: Acquisition of Content– B2. Ingest: Creation of the Archival Package– B3. Preservation Planning– B4. Archival Storage & Preservation/Maintenance

of AIPs– B5. Information Management– B6. Access Management

Page 27: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

TRAC:B. Digital Object Management

• B2 Ingest: Creation of the Archival Package– B2.1 “Repository has an identifiable, written definition for

each AIP or class of information preserved by the repository.”

Evidence: Documentation identifying each class of AIP and describing how each is implemented within the repository. Implementations may, for example, involve some combination of files, databases, and/or documents

– H-Net: Not documented at time of analysis– Narrow the gap: Preservation strategy document,

Information Packages document, clear policies

Page 28: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

TRAC:B. Digital Object Management

• B4 Archival Storage & Preservation/Maintenance of AIPs– B4.4 “Repository actively monitors integrity of archival

objects (I.e., AIPs).”

Evidence: Logs of fixity checks (e.g., checksums); documentation of how AIPs and Fixity Information are kept separate

– H-Net: No integrity monitoring at time of analysis– Narrow the gap: Use MD5 hashes to perform fixity checks,

keep logs of most current fixity checks, document fixity process

Page 29: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

TRAC:C. Technologies, Technical Infrastructure, & Security

• Adequacy of technical infrastructure and ability to meet management and security demands of repository and its digital objects– C1. System Infrastructure– C2. Appropriate Technologies– C3. Security

Page 30: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

TRAC:C. Technologies, Technical Infrastructure, & Security

• C1. System Infrastructure– C1.1 Repository functions on well-supported operating

systems and other core infrastructural software

Evidence: Software inventory; system documentation; support contracts; use of strongly community-supported software (e.g., Apache)

– H-Net: H-Net servers run on Debian distribution of Linux– No gap! Include in digital preservation policy documents

Page 31: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

TRAC:C. Technologies, Technical Infrastructure, & Security

• C3. Security– C3.4 Repository has suitable written disaster preparedness

and recovery plan(s), including at least one off-site backup of all preserved information together with an off-site copy of the recovery plan(s).Evidence: ISO 17799 certification; disaster and recovery plans; information about and proof of at least one off-site copy of preserved information; service continuity plan; documentation linking roles with activities; local geological, geographical, or meteorological data or threat assessments

– H-Net: No plan in place, no off-site backups at time of assessment

– Narrowing the gap: Develop disaster recovery plan, off-site storage plans; tap into greater MSU disaster recovery plan

Page 32: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

The TRAC Experience

• Tedious—84 criteria!• Required consultations with systems admin,

other MATRIX and H-Net staff• Conducted over course of a week• Thorough yet flexible, leaving room for

interpretation, lots of options for supporting documentation/evidence

Page 33: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

The TRAC Experience

• Good snapshot of current state of repository • Clarifies what’s needed to narrow the gap• Great internal audit tool• Useful for certification of a trusted digital

repository

Page 34: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Integrated Rule Oriented Data System (iRODS) and TRAC

• Data grid software system developed by the Data Intensive Cyber Environments (DICE) group (http://www.irods.org)

• Provides a unified view and seamless access to distributed digital objects across a wide area network

• Goal: Automate archival processes, data management tasks, verification of assessment criteria (TRAC)

• How: Map management policies to rules enforced by data management system

• DICE wishes to use H-Net as iRODS testbed

Page 35: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Digital Preservation Management Workshop and TRAC

• Workshop developed by Nancy McGovern, Digital Preservation Officer at ICPSR

• Combines organizational and technological perspectives for institutions to develop appropriate responses to challenges of digital preservation– Two foundation documents

• Trusted Digital Repositories: Attributes and Responsibilities• OAIS Reference Model

– Attributes of a Trusted Digital Repository

Page 36: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Digital Preservation Management Workshop and TRAC

• Action plans for policy, technology, and resource frameworks that map to TRAC– Organizational Infrastructure Section A– Technological Infrastructure Sections B, C– Resource Framework Section A

• Wealth of policy examples• Tools for determining costs• 2009 workshops:

http://www.icpsr.umich.edu/dpm/

Page 37: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Preservation Improvement Plan:Backup & Archival Storage

Backup• Backup log

• Secure storage

• Second permanent backup tape stored offsite

• Reciprocal backup storage arrangement with ICPSR

Archival Storage• Annual copying to tape of H-Net data, databases,scripts

• Media refreshment every 5 years

• Copy to alternative storage repository

• Participation in distributed archival storage system, such as iRODS or LOCKSS

Page 38: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Preservation Improvement Plan:Authenticity

Fixity: Individual Messages (SIPs/AIPs)• Shorten time window for generation of MD5 hashes• Create database of MD5 hashes for fixity checks• Validate message hashes on notebook completion

Fixity: Notebook Files (AICs)• Create SHA-2 message digests on completion of notebooks• Calculate SHA-2 message digests for existing notebooks• Create database of SHA-2 message digests for fixity checks• Validate notebook hashes on weekly basis

Page 39: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Preservation Improvement Plan:Authenticity

Accurate Message Creation Metadata• Build list editing web interface for editors• H-Net Council decided not to implement

Restriction of Editors’ Administration Capabilities• Eliminate editors’ ability to retrieve and change notebooks• Restrict notebook modification rights to MATRIX postmasters

H-Net Tampering Risk?• Low—staff with root system account privileges are trusted employees• No action required

Page 40: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Preservation Improvement Plan:Attachments

Browser Access for Private Lists• Provide constructed URLs, as with public lists• Provide download links to attachments

Migration Strategy• Conduct inventory of attachments on H-Net-related lists• Establish or leverage technology watch • Provide links to websites containing conversion tools

Page 41: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

Preservation Improvement Plan:Narrowing the TRAC Gap

• Technical improvements underway• Digital preservation policies to be written• Lather, rinse, repeat:

New TRAC assessment• Future: Use TRAC to design MSU

institutional repository

Page 42: Preserving the H-Net E-Mail Lists: A Case Study in Trusted Digital Repository Assessment Lisa M. Schmidt lisa.schmidt@matrix.msu.edu

References

• H-Net Archives Project, http://www.h-net.org/archive/• H-Net: Humanities and Social Sciences Online,

http://www.h-net.org• MATRIX: The Center for Humane Arts, Letters, and Social

Sciences Online, http://www.matrix.msu.edu• OAIS Reference Model,

http://public.ccsds.org/publications/archive/650x0b1.pdf• Trusted Digital Repositories: Attributes and Responsibilities,

http://www.oclc.org/programs/ourwork/past/trustedrep/repositories.pdf

• Trustworthy Repositories Audit & Certification: Criteria and Checklist, http://www.crl.edu/PDF/trac.pdf