beecher cni fall 2010 v4

Post on 17-Jan-2015

695 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

This is a talk from the Coalition for Networked Information Fall 2010 Member Meeting (CNIfall2010). I talked about our project to use Fedora as archival storage for social science research data and documentation.

TRANSCRIPT

Preserving Social Science Research Data Using Fedora

Bryan Beecher

Inter-university Consortium for Political and Social Research (ICPSR)

CNI Fall 2010 Membership Meeting

ICPSR

• World’s largest social science research data archive– Lots of files (millions)– Small files (6TB total)

• Long track record of success – 50 yrs– Trust us– Enormous legacy burden

ICPSR

• Survey data are our core– Low volume of new content compared

to natural sciences– We curate each item extensively

(disclosure, quality, format, usability)

• Strong access orientation– Talk like an archive– Walk like an archive?

Walking the walk

• Good storage container for content and its metadata

• OAIS-compliant• Generate SIPs and AIPs (and DIPs)• But…

What should we do?

Where to begin?

Focus areas• Preservation• Going forward• Reusable

Do not try to include• Access• Everything we have

A Solution

• Fedora objects– Container for stuff we ingest and

preserve

• Fedora services– To generate AIPs and SIPs

• Tool to generate FOs from existing content and metadata

Ingest

• The Motivated Depositor– Eager to describe

the research data in great detail

– Uploads complete, machine-readable metadata

Ingest (continued)

• The Unmotivated Depositor– Upload a variety

of proprietary file formats for documentation and data

– Leaves the baby on the doorstep

Ingest – Nov 2010 deposits

Ingest (continued)

• Typical deposit– Research data in one of the common

stat packages (SAS, SPSS, etc)– Technical documentation in a

proprietary format (Word, PDF)– A proto-SIP in quasi-OAIS terms– Minimal level of metadata regarding

how the survey was conducted

Ingest container – file level

• Vanilla Fedora Object– Will never know

what sort of content format to expect

– Use the RELS-EXT to connect related files

Ingest container – deposit

• Another plain Fedora Object– Points to all of the

files stored in the file-level objects

– Relatively little metadata stored for this level of object

Ingest container – example

Ingest container – example

Ingest and the OAIS PDI

• Reference – unique Fedora PID• Fixity – Fedora-generated checksum• Provenance – identity of depositor

recorded in the DC Datastream• Context – original file name captured

in the content Datastream• Access Rights – terms of deposit

Generating OAIS SIPs

• Original content– Normalized version too, if applicable– What’s normalization in this context?

• Preservation Description Information (PDI)– As described previously

• Delivered via SDef/SDep combo

Ingest – continued

• Data– Disclosure analysis– Recoding

• Documentation– Corrections– Clarifications

• Normalized formats

Ingest – finale

• Packaged into a “study”– Data, doc

questionnaire, user guide, etc

– Normalized formats for preservation

– Convenient formats for access

Ingest – finale

PID

REPORT(test/plain)

objectProperties

DC

RELS-EXT

AUDIT

icpsr:release-28748-file-3

QUESTIONNAIRE(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

icpsr:release-28748-file-1

STATA-DICT(text/plain)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

DATA(text/plain)

DDI(text/xml)

SAS-SETUPS(text/plain)

SPSS-SETUPS(text/plain)

STATA-SETUPS(text/plain)

icpsr:release-28748-file-2

CODEBOOK(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

Generating OAIS AIPs

• For each object (file)– Everything from the SIP plus

• Preservation events• Description of the transformation used• Preservation commitment

– Its post-processed version

• Delivered via SDef/SDep combo

Example AIP

PID

REPORT(test/plain)

objectProperties

DC

RELS-EXT

AUDIT

icpsr:release-28748-file-3

QUESTIONNAIRE(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

icpsr:release-28748-file-1

STATA-DICT(text/plain)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

DATA(text/plain)

DDI(text/xml)

SAS-SETUPS(text/plain)

SPSS-SETUPS(text/plain)

STATA-SETUPS(text/plain)

icpsr:release-28748-file-2

CODEBOOK(application/pdf)

objectProperties

DC

RELS-EXTisPartOf: release-15868

AUDIT

PID

objectProperties

DC

RELS-EXT

AUDIT

Questions we faced

• Datastreams or relationships?• What about our XML?• AIPs or DIPs?• How to build FOXML?

Datastreams /relationships?

PID

CONTENT X

objectProperties

DC

RELS-EXT

AUDIT

PID

CONTENT Y

objectProperties

DC

RELS-EXT

AUDIT

PID

CONTENT Y

objectProperties

DC

RELS-EXT

AUDIT

CONTENT X

Our XML

• DDI v2– Contains lots of the information one

might expect to find in the DC

• Strategy– Duplicate it

AIPs or DIPs

• Lots of copies• Destination

– Archival Storage remote location– Repository for ingest

Building FOXML

• Source– Database– DDI XML

• Re-usable tool

Special Thanks

The Team• Peggy Overcashier• Nathan Adams• Nancy McGovern• Mary Vardigan

The Funder• National Science

Foundation Award 0958382

• INTEROP EAGER program

top related