Digital PreservationDigitization Basics for Archives and
Special Collections - Part 2: Store and Share
WiLSWorld 2015
SD
JH
CP
UW Digital Collections Center
Steven DastDigital Asset Librarian
Jesse HendersonDigital Services Librarian
Cat PhanDigital Services Librarian
“. . . including practical steps you can take to preserve your digital content with limited resources.”
Characteristics of digital information
Strengths● Easy to make and transmit
perfect copies● Machine readable content and
metadata facilitates automation● Storage relatively inexpensive
and becoming more so
Challenges● Fragile, easily malleable● Storage media not durable● High density of storage● Requires technology to render
into human readable formo Obsolescenceo Early signs of loss may not
be apparento Loss generally extensive
Two primary stages in digital lifecycle
Creation stage● Intense, focused
action● Maximize value of
digital material● Risk of errors
Preservation stage● Long-term,
sporadic action● Minimize cost of
maintenance● Risk of failures
Strategies for digital preservation
Take advantage of our strengths:● Make lots of copies in different places● Automate file handling and management
Take steps to minimize challenges...
Strategies for both phases
Use broadly supported standard file format that store uncompressed data:TIFF for imagesWAV for audio● mitigates obsolescence, data fragility● facilitates future bulk processing
Strategies for both phases
Work as consistently as possible; keep good records; document special cases● Reduce cost of future preservation actions
Strategies for both phases
Use file naming system that is simple and consistent, but flexible.Remember: whatever system you choose is (almost) entirely for your convenience—to the computer they’re all just strings of characters.Nevertheless, tool requirements (if and when they exist) override any other factors.
Strategies — File naming
Avoid spaces and special characters (/ \ : * ? “)Use letters and numbers, underscore ( _ ), hyphen( - ).Dot ( . ) is okay, but has a special functionFor broadest compatibility, use 8.3 conventionDon’t use capitalization for meaningful differences
Strategies — File naming
Effects and side effects of file names● Identity● Order / sequencing (0122.tif 0123.tif 0124.tif)
● Collocation / grouping(ncb01.tif ncb02.tif ncb03.tif mca01.tif mca02.tif)
Strategies — File naming
Using meaningful file names can● Facilitate error detection and recovery
o missing or misplaced files● Aid ‘manual’ handling and checking of files
o Name order = natural ordero Name reflects content of file in some way
● Increase maintenance and correction costso Insertion or deletion of files in a sequence
Strategies — File naming
Also use directories to help organize files● Same naming conventions apply (avoid . )● Same naming benefits and cautions● Nesting directories allows for richer
hierarchical relationships, but may foil some automation options
● Limit to 500-1000 files when feasible
Strategies — File naming
UWDCC naming system for books:One directory per volume, with flexible four-digit sequential filenames. Directories may be grouped for multi-volume monographs, by series, by project, or several of the aboveUWMad/Yearbooks/Yrbk1972/0001.tif
Strategies — File naming
UWDCC naming system for photographs:Short alpha pre-fix with a flexible serial number, ad hoc system of separation into directories, usually based on serial numberUWArchives/uwar02/uwar02345.tif
Strategies — File naming
Bottom line:If you have technical requirements for file names, follow them.Beyond that, choose a system that maximizes human utility, keeping in mind the balance between encoded meaning and requirements for maintenance
Strategies for creation phase
Create high-quality digital surrogates sufficient to meet current and anticipated needs● Encourages future investment in the material
Strategies for creation phase
Create backups of current work and maintain fall-back positions in case corrections are needed● Reduces cost of errors● Mitigates fragility and malleability of data
Strategies for creation phase
Check your work at major transitions, not just for quality issues, but also for completeness and accuracy● Increases value of the collection● Facilitates future processing
Strategies for preservation phase
Choose storage media that best match your resources and requirements.● Make multiple copies so that you can react to failure● If possible, mitigate technological risk by storing files on
different types of media● Mitigate risk of physical disasters by storing media in
multiple locations
Strategies — Storage mediaTechnology Size Stability Cost
Flash storage 4 – 256 GB 5-20 years or less $0.50/GB
Hard drive (magnetic disk)
1 TB – ? 25-30 years, prone to mechanical failure
$0.05/GB +++
Magnetic tape 400 GB – 2.5 TB 25-30 years $0.01–0.50/GB
CD-R 630–700 MB 100–200 years for high-quality media (MAM-A)
$2.50/disc =$3.50/GB
DVD-R/+R 4.7 GB 100–200 years (?) for high-quality media
$2.50–4.00/disc = $0.50-0.85/GB
The Cloud 1 – 30 TB ? $0.002–0.10/GB monthly!
Strategies — Storage media
Over its history, UWDCC has used● JAZ disks● Duplicate CD-R● Duplicate data tapes● Hard drives with duplicate data tapesWe currently have ~18 TB of archived data
Strategies — Storage media
Recommended options for getting started● CD-R or DVD-R/+R
o Use the good stuff: MAM-A Gold Archive mediao Always make duplicateso Consider supplementing with cloud storage
● Graduate to hard driveso Active RAID-enabled disks much safer than stand-
alone hardware sitting on a shelf● Add tape when technology staff can support
Strategies — Storage media
Avoid● Flash drives — too unstable● Reliance on the Cloud as your only archive
Strategies for preservation phase
Anytime you move data to a new medium or a new physical device, verify!(Now that you’re no longer actively working with the files, it’s easy for a bad transfer to go unnoticed.)If the new media/device can be write-protected, do so.
Strategies for preservation phase
Create checksums for each file that you archive● Use now to verify files on transfer● Use later to detect data degradation● Also useful to determine whether files are
actually the same or not
Strategies for preservation phase
Keep track (metadata!) of where your files are archived ● Material that can’t be located has not been
preserved● Will help to prioritize future preservation
actions
UWDCC workflow
1. Metadata first: checklist for subsequent work2-5. Working files organized under three directories: original, inprocess, final
Initial scan to ‘original’ - never editedCopy to ‘inprocess’ - cleaned up for accessFinished version to ‘final’ - metadata check
6. Distribution files created from ‘final’ masters
UWDCC workflow
7. ‘Click-through’ all images in test mode7a. Once all is correct: public release!8a. Recheck files against metadata8b. Create checksums for local files8c. Transfer files to archival media8d. Verify checksums for transferred files9. Now safe to delete working copies
UWDCC Tools
● Microsoft Excel or FileMaker Pro for metadata entry (sometimes Access)
● Variety of scanners chosen to maximize flexibility● Manufacturer’s software / VueScan● GoldenThread (ISA) for evaluating scanner quality● Adobe Photoshop for image editing● AppleScript for custom automation of various workflow
tasks● Built-in Unix functions for checksums, file-handling
Other tool options
Image editing:GIMP (Windows, Mac, Linux)Paint.net (Windows)
Automation:VBScript, JScript, VBA (Windows)Python (Windows, Mac, Linux)
Checksum and verification:Fastsum, Checksum (corz.org) (Windows)
SummaryBoth Phases Creation Phase Preservation Phase
★ Use broadly supported standard file formats(tiff, wav)
★ Develop consistent workflow, document special cases
★ File naming - follow technical rules; design it for humans
○ Balance between using filename for meaning and keeping it easy to maintain
★ Start with high-quality scans of source documents
★ Make backups of current work, maintain fall-back positions
★ Check work at major transitions
★ Storage media○ Start: CD-R or DVD-
R/+R, maybe supplement with Cloud
○ Step up: hard drives○ Add tape if can
support(Avoid flash drives and Cloud as sole archive)
★ Verify anytime you move things
★ Write-protect if you can★ Create checksums★ Metadata:
Know what you have, where it is, and what you can do with it
Selected references and readingGeneral DPhttp://digitalpowrr.niu.edu/wp-content/uploads/2014/05/Overwhelmed-to-action.rinehart_prudhomme_huot_2014.pdfhttp://commons.lib.niu.edu/handle/10843/13610http://files.eric.ed.gov/fulltext/ED426715.pdfhttps://en.wikipedia.org/wiki/Digital_preservation
Filenaminghttp://www.jiscdigitalmedia.ac.uk/guide/choosing-a-file-name
Storage mediahttp://www.nps.gov/museum/publications/conserveogram/22-05.pdf
Selected tools and resourcesScanninghttp://www.hamrick.com (Vuescan)http://www.imagescienceassociates.com/(GoldenThread)
Image editinghttp://www.gimp.orghttp://www.getpaint.net/index.html
Archival CDs and DVDshttp://www.mam-a-store.com
Scripting
http://www.pctools.com/guides/article/id/2/page/1/
https://www.python.org
http://macosxautomation.com/applescript/firsttutorial/index.html
Checksum tools
http://www.fastsum.com
http://corz.org/windows/software/checksum/
UWDC & Digital as PreservationThe UWDCC recently launched a pilot project in collaboration with our Preservation Departmentto develop standards and guidelines for utilizing digitization as a preservation medium at UW-Madison.
This presentation focuses primarily on workflowand only on changes we can and have implemented in our current environment for preservation-level projects.
Detail from page 2 of ‘The modern priscilla’ Vol. XXXVI, No. V (July, 1922). The Dovie Horvitz Collection.
UWDC & Digital as Preservation
Type Hardware Software
High Speed scanning Panasonic KV-S3065C High Speed Color Scanner
Reliable Throughput Image Viewer (RTIV)
Flatbed scanning Epson Expression 10000XL (includes one with Epson A3 Transparency adapter) Epson Expression 11000XL
Epson Scan Utility
Overhead Reprographic scanning
BetterLight Super 6K-HS Digital Scanning Back
ViewFinder camera control software
Slide scanning Nikon Super COOLSCAN 5000 ED film scanner
VueScan scanner software
Digital photography
Equipment
UWDC & Digital as Preservation
The basics:● What is Preservation? - Extending the useful life of our stuff.● Why do we do it? Protect, Represent, Transcend.
Do something with those berries before they spoil! Pickle something! In essence, preservation is extending the useful life of our stuff.
Don’t let those veggies just turn into compost. Protect! Secure the value and usefulness of our resources.
Taste the summer sunshine in your veggies when you eat them out of season.Represent! We want our digital formats to be an authentic representation of the original.
Pickles and jam exist only when cucumbers and berries are transformed into something new Transcend! Preserve originals to take advantage of and/or discover new uses.
UWDC & Digital as Preservation
Prep:1. Identify
What do we have that needs preserving? Where did it come from?
2. Evaluate & AssessMake sure our equipment and ingredients are up to the preservation process. Figure out how much we can handle at one time.
3. SelectCondition: Does one thing spoil faster than another? High use: Which items circulate the most? Scarcity: What are others not preserving?
4. Review your recipeConsult the cookbooks (in our case FADGI) and make sure you’ve read through your recipe. Have everything you need before you start.
Steps 1 & 3 handled by our Preservation Department.Steps 2 & 4 done by UWDCC.
UWDC & Digital as Preservation
What did this look like at UWDC?● Researched current literature - focus on FADGI.● Established baseline, optimum performance data for hardware -
GoldenThread
UWDC & Digital as Preservation
FADGI = whoa…Lots to digest! Our takeaways:Evaluate and Assess our digitization environment & tweak our recipe● Quantifying Scanner Performance ● Targets and software to use for this: GoldenThread● Color Management
Appendix A: Digitizing for Preservation Reformatting of PhotographsCompare characteristics of preservation vs.production master files.
UWDC & Digital as Preservation
Using GoldenThread● Flatbeds and Epson Scan software - customizing the color balance
settings per scanner● BetterLights and ViewFinder software - custom tone curves per set-up,
per scanner
UWDC & Digital as Preservation
Using targets and software to determine performance3s: +/- 6 aim points4s: +/- 3 aim points
UWDC & Digital as Preservation
Established baseline, optimum performance.Establish maintenance schedule.
UWDC & Digital as PreservationMonthly: Check BetterLight and Flatbed performance against baseline performance with Golden Thread
Quarterly: Calibrate monitors on reformatting supervisors’ computersZig-Align BetterLights (or more frequently if needed)
Biannually: Calibrate scanning station monitorsCalibrate and characterize BetterLights(create new baseline tone curves in the software)Calibrate and characterize Flatbeds(update histogram settings)
UWDC & Digital as Preservation
Access recipe:● 300 dpi● 24-bit color or (grayscale on our high speed scanner)● Flatbed, BetterLight or high speed scanner● Custom tone curves on BL software per set-up● Custom histograms on Flatbeds● Cropping borders based on project● “Cooked” masters archived
Original object itself is the preservation master (you intend to hold onto it) and digital surrogates are for access.
UWDC & Digital as Preservation
Preservation recipe:● 400 dpi● 24-bit color● BetterLight only (for now)● Custom tone curves per project/issue● Object target captured per page/scan● Device target per project/issue/day● Always crop outside the pages● “Raw” and “Cooked” masters archived
Digital version expected to be the preservation master in the absence of the original object,therefore highest possible fidelity is desired.