wax: a candle in the darkness a digital to digital project wendy gogel, andrea goethals harvard...

43
WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May 1, 2009

Post on 21-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

WAX: A candle in the darkness

A digital to digital projectWendy Gogel, Andrea Goethals

Harvard University Library, Office for Information Systems

May 1, 2009

Page 2: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Today’s Journey

• The Darkness – The WebIntroducing the challenge of web archiving

• The Candle – WAXHUL’s Web Archive Collection Service

• The Light – The Collections Demonstrating the results

Page 3: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

The Darkness: The Web

Page 4: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

The Challenges of Web Archiving

• A fleeting record – here today, gone tomorrow• Government Documents• Public Debate• Culture • Personal expression• University Output

Page 5: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Harvard Magazine May/June 2009

Page 6: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Curator Activities

• Selection • Acquisition• Rights management• Quality assurance • Arrangement• Storage • Description and indexing for

discovery (cataloguing, searching, browsing)

• Presentations and exhibitions • Preservation

Page 7: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

IP and Other Legal Risks

• Copyright infringement• State tort liability

• Civil damages, resulting from invasion of privacy, sensitive personal data, commercial content, defamatory content

• Statutory content restrictions• Foreign Laws

Page 8: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Preservation Challenges • We were not there at creation

• Viruses more likely• Formats misidentify themselves• A lot of formats are invalid (especially

HTML)• It’s a moving target – what should we

preserve?• Evolving born digital formats• Proliferation of formats• Partial capture • Complex behaviors and styles

• Complex delivery to maintain• Hyperlinked resources• Multiple renderers will continue to evolve

Page 9: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

2006/07 AlternativesSelection

Crawling

Management (QA and Metadata)

Storage

Preservation

Discovery and Display

Notes

Wayback (IA)

No Yes No Yes Partial - Replicated storage – Not Harvard owned

No full textsearching

Contract IA

Yes Yes No, handle in-house

No, Handle in-house

No, Handle in-house

No, Handlein-house

Archive It! (IA)

Yes Yes Minimal, has since improved

Yes Partial - Replicated storage

Minimal, has since improved

2008 costs:$16,000/yr $2,000/yrHarvard copy

Customize IIPC Tools (WAX)*

Yes Yes Yes Yes More than others

Yes

* Additional benefit of integration with HUL central services

Page 10: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May
Page 11: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May
Page 12: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May
Page 13: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May
Page 14: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

The Candle: WAX

Page 15: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

HUL’s Web Archiving Project

• 2.5 year pilot project funded by LDI

• Key Goals1. Gain experience in domain2. Explore legal terrain3. Investigate sustainability of a

Harvard web archiving service• quantify technical, human, and

$ requirements• aim for operational efficiencies

Page 16: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Project Players

1. Curators and Collection Managers• Harvard University Archives • Schlesinger Library on the

History of Women in America• Edwin O. Reischauer Institute

of Japanese Studies

2. Legal Counsel – Office of General Counsel (OGC)

3. Technologists - OIS

Page 17: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

What Did We Build? WAX

Page 18: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

What Did We Build? WAX

Page 19: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

What Did We Build? WAX

Page 20: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

What Did We Build? WAX

Page 21: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Third Party Software

• International Internet Preservation Consortium (IIPC) tools www.netpreserve.org• Heritrix• HCC• NutchWAX• Wayback

• JBoss• Oracle• Struts• Tomcat• Quartz job scheduler

Page 22: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

The Web is vast and interconnected.

How do you specify the part you want to capture?

Or “training a web crawler”…

Page 23: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

How to Train a Web Crawler

1. Tell it where to start• “Seed URI”

2. Tell it what to collect and where to stop• “Scope”

3. Tell it when and how often • “Schedule”

Page 24: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May
Page 25: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May
Page 26: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May
Page 27: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Web Archiving Steps

1. Create a harvest profileIdentify website URI (“seed”), define scope

and schedule

2. Harvest web site3. QA harvest4. Send harvest to DRS5. Index harvest

Becomes searchable and viewable by users

A lot of work per website – which can automated?

Page 28: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Web Archiving Steps

Manual by curator → 1. Create a harvest profile

Automated byscheduler and crawler

software →

2. Harvest web site

Manual by curator → 3. QA harvest

Manual by curator → 4. Send harvest to DRS

Automated byIndexing software →

5. Index harvest

Page 29: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Workflow Efficiencies

• Curator’s manual tasks:• Create a harvest profile

• 3 scopes: Directory, host and host+1• Schedules• Global excluded URIs

• QA harvests• Remove unwanted pieces• Detect missing pieces• Refinement of seed scope

• Send harvests to DRS

How can the system help with these tasks?

Page 30: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Efficiencies: QA Harvests

Page 31: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

• Exclude URIs from future crawls

• Delete URIs from harvest

• Delete URIs from harvest and Exclude them from future crawls

Page 32: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Efficiencies: Send Harvests to DRS

Page 33: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

The Ultimate Shortcut?

• Can pre-configure WAX to send harvests directly to the DRS • Skip QA step• Skip push to archive step

Page 34: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Web Harvest Objects: Unit of Preservation in the DRS

• For each crawl starting from a seed URI:• One or more ARC files (*.arc.gz)

• contain one or more “resources” - the individual HTML, JPEG, Javascript, etc. files that make up the harvested web pages

• Crawl log• records all URI requests, regardless

of result• Crawler configuration• Metadata

• descriptive, administrative, technical

Page 35: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

WAX Legal Mitigations: Crawls

• Polite crawling• Obey robots.txt• Leave WAX crawler information in

logs

• Employ a respectful “request frequency” during crawls• Don’t overload web servers

• Capture surface web only• No attempt to crawl protected

content

• Choice of offsite crawler for curators• Non-Harvard IP address

Page 36: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

WAX Legal Mitigations: Use

• Don’t compete with or divert traffic from live site• Exclude robots from the WAX

archive• Add transformative content

• Framing• Presentation pages with original

intellectual content

• Embargo display for 3 months• Link to live site

Page 37: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

The Collections

• 191 “seeds” identified by curators for harvesting

• Stored in DRS: • Over 8 million web archive

resources• 365.17 gigabytes of storage

($913/year)• 291 mime types

Page 38: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

application/x-download

message/rfc822

image/x-portable-anymap

javascript/x-javascript

application/bds

image/png?ver=074219b2138e87ecf980914471183dfc

application/xrds+xml

"text/xml"

image/x-bmp

gif

application/x-rar-compressed

Image/png

mime/type

image/null

text/troff

application/vnd.sun.xml.impress

text/enriched

application/icalendar

application-x/javascript

x-mapp-php4

imag/x-icon

application/x-shockwave-flash2-preview

Swish

image/x-photoshop

application/x-quicktimeplayer

application/x-java-vm

text/Javascript

text\css

application/x-Shockwave-Flash

png

text/x-c++

image/x-cmu-raster

httpd/yahoo-send-as-is

application/x-mpeg

Video/X-Flv

text/x-python

audio/x-scpls

application/pgp-keys

text/calendar

text/x-vcard

application/octet-string

application/x-troff-me

video/x-m4v

application/pgp-signature

image/x-portable-graymap

image/#{favicon_formats[format]}

image/files/curryjpg

test/xml

text/x-invalid

video/x-flv

text/javascript+json

Shockwave

audio/x-realaudio

chemical/mdl-rdf

content-type

text/text

Text/HTML

audio/mid

text/Calendar

application/x-wais-source

application/x-perl

image/txt

Applicationxm

PNG

x-png

unknown/unknown

text/x-javascript

application/octetstream

Image

application/x-sh

audio/x-mpegurl

audio/unknown

chemical/x-xyz

application/perl

application/x.atom+xml

application/octet_stream

video/mp4

Page 39: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

The Light: The Collections

Page 40: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

The PartnersMegan Sniffin-Marinoff, University Archivist

A-Sites: Archived Harvard Web Sites collected by the Harvard University Archives

Marilyn Dunn, Executive Director of the Schlesinger Library and Librarian of the Radcliffe Institute

Blogs: Capturing Women's Voices collected by the Arthur and Elizabeth Schlesinger Library on the History of Women in America

Helen Hardacre, Reischauer Institute Professor of Japanese Religions and Society

Web Archiving Project on Constitutional Revision collected by the Reischauer Institute of Japanese Studies with Sponsorship from the Harvard College Library Documentation Center on Contemporary Japan

Page 41: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

To Participate

http://hul.harvard.edu/ois/systems/wax

Page 42: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Questions?

“…we have rather chosen to fill our hives with honey and wax, thus furnishing mankind with the two noblest of things, which are sweetness and light.”

Jonathan Swift

Page 43: WAX: A candle in the darkness A digital to digital project Wendy Gogel, Andrea Goethals Harvard University Library, Office for Information Systems May

Image Credits

Title slide:http://www.flickr.com/photos/lwr/59014972/in/set-1552655/

The darkness:http://www.melegraph.com/images/outerspace.jpg

The candle:http://www.sxc.hu/pic/m/a/as/asolario/

472153_peach_votive_candle.jpg

The Web:http://projecta-z.com/Internet_map_1024.jpg

The lighthttp://i252.photobucket.com/albums/hh2/habeba2007/

candles-1-1.gif