saa research forum ataugust 10, 2010files.archivists.org/...saa-researchforum2010.pdf · saa...

17
The “M” word: What works and what doesn’t in file format migration SAA Research Forum A t August 10, 2010 http://digital.ncdcr.gov ~ http://webarchives.ncdcr.gov ~ [email protected] State Library of North Carolina ~ Digital Information Management Program http://digital.ncdcr.gov ~ http://webarchives.ncdcr.gov ~ [email protected] State Library of North Carolina ~ Digital Information Management Program

Upload: others

Post on 24-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

The “M” word: What works and what doesn’t in file format migration

SAA Research ForumA t   August 10, 2010

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Page 2: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

The State Library of North Carolina's Digital Information Management Program Digital Information Management Program 

A hi I Archive‐It CONTENTdm local storage/OCLC’s Digital Archive

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Page 3: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

Our Approach

• Look at what others were doingk h h d

h

• Look at what we had• Identify how others were                   

http://ww

w.inter

transforming the type of files                           that we had

rsema.ch/com

p

• Determine the transformation                        path and tool

pany/strategy/p• Perform the transformation• Evaluate the results

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

• Evaluate the results

Page 4: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

What did we have?

• A/V

PD23

500

• Text, cont.– MOV– MP3

• Imagesp?n

umbe

r=HP – DOCX

– PDF– PPTImage

– GIF– JPG

TIFFcom/prodinfo.a

– PUB– RTF– TXT– TIFF

– PSD– AI

tainmen

tearth

.c – TXT– XLS

• Web• Text

– CSSDOCp:

//www.entert

– HTM– ARC– WARC

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

– DOC

http WARC

Page 5: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

What was our transformation process?• MP3 to WAV (ffmpeg)• GIF, JPG, PSD formats to TIFF (PLANETS/ ImageMagick)• AI to SVG (Inkscape)• DOC & DOCX formats to ODT (Xena)

(C )• RTF to PDF/A (ConvertDoc)• MOV to AVI (ffmpeg)• PPT to ODP (Xena)• PPT to ODP (Xena)• CSS to TXT (Xena)• PUB to PDF/A (Zamzar)

(CC

) Larry DU to / ( a a )• XLS to ODF (Xena)

D. M

oore

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Page 6: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

What did we look for?• In the tools

• Free, open source, documented/supported( f bl hG d l )al

uate

• Easy to use (preferably with GUI…not command line)• Versatile (transforms multiple formats, single or batch, 

etc.)• Reportings/

inde

x/ite

m/e

va

• Reporting• In the transformations

• Minimal loss of content/minimal degradation of audio/video qualityr_

dev_

proc

ess

audio/video quality• Retention of basic data structure (i.e., text structure of 

.css)• Retention of metadatant

gate

.ro/c

aree

• Retention of interactive functions (i.e., layers in images, formula/macros in spreadsheets, hyperlinks, etc.)

• Retention of security

http

://w

ww.

tale

n

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

• Retention of embedded content (i.e., audio, images, etc.)h

Page 7: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

Drum Roll Please…

W l  Di  S di

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Walt Disney Studios

Page 8: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

Findings: Audio/Video Files

Couldn’t find a free/open‐source .mov to j   i   l .mj2 conversion tool so…• Converting .mov to .avi resulted in significant degradation of the quality of the video

• Converting .mov to .mpeg resulted in a similar degradation (as we expected)g ( p )

Converting .mp3 to .wav resulted in no noticeable loss of content or quality

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

noticeable loss of content or quality

Page 9: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

Example: Video Degradation.mov (original)

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Page 10: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

Example: Video Degradation.avi (transformation)

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Page 11: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

Findings: Image Files

None of the conversions resulted in any noticeable loss of content   howevernoticeable loss of content , however…

– Converting .ai to .svg caused some font formatting and coloring to be a bit offformatting and coloring to be a bit off

– Converting .gif to .png resulted in degraded image qualityimage quality

– Converting .jpg to .tif resulted in some metadata lossloss

Interestingly, converting .psd to .tif resulted in marginal image quality improvement

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

marginal image quality improvement

Page 12: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

Example: Font Change

ai (original).ai (original)

.svg (transformation)

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Page 13: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

Findings: Text FilesAgain, none of the conversions resulted in any noticeable loss of content  and

Converting  css to  txt resulted in no noticeable loss of data structure– Converting .css to .txt resulted in no noticeable loss of data structure– In general, converting to open document formats resulted in links and 

embedded content remaining intact, as well as interactive functionality like formula and macros

However…– Converting .doc to .odt resulted in table and tab structures being lost– Converting .docx to .odt resulted in some formatting (like bullets) was 

alteredaltered– Converting .ppt to .odp resulted in the loss of page numbers and 

changes to fonts and footers– Converting .pub to .pdf/a resulted in the ability to edit the file was lostg p p / y– Converting .rtf to .pdf/a and .xls to .odf resulted in the loss of some 

metadata

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Page 14: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

Example: Wonky Footer.odp (transformation).ppt (original) 

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Page 15: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

What we foundhttp://physics

• In general, the migrations went much better than we had 

.wku.edu/~bon

expected ‐ the loss we experienced was loss that we could tolerate

nham/Talks/Faccould tolerate

• Loss of content doesn’t seem to be an issue for these more 

cultyWorkshop/

“modern” file types• Need to be comfortable using a 

command line to invoke 

/index.php?partcommand line to invoke migration actions with the open source tools available today

t=Motivation&

it

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

tem=D

ump

Page 16: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

What can we learn from you?• Do you know any free, open 

source tools to convert:

http://garrmdc

• .mov to .mj2• .tif (compressed) to .tif

(uncompressed) t  

c.blogspot.com/• .arc to .warc

• How does your process differ from ours? Do you have any suggestions for us?

/2007/06/know

suggestions for us?• If you have transformed older files, 

how did your results differ from ? A  th  i     i ht 

ledge-managemours? Are there issues we might 

not expect that we should be prepared to face?

ment.htm

l

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m  

Page 17: SAA Research Forum AtAugust 10, 2010files.archivists.org/...SAA-ResearchForum2010.pdf · SAA Research Forum AtAugust 10, 2010 ... r_dev_proces audio/video quality s • Retention

[email protected]

h t t p : / / d i g i t a l . n c d c r . g o v   ~   h t t p : / / w e b a r c h i v e s . n c d c r . g o v   ~   d i g i t a l . i n f o @ n c d c r . g o vS t a t e   L i b r a r y   o f   N o r t h   C a r o l i n a   ~   D i g i t a l   I n f o r m a t i o n   M a n a g e m e n t   P r o g r a m