caught in the web: web archiving at u of a libraries

Caught in the Web: Caught in the Web: Web Archiving at U of A Web Archiving at U of A LibrariesLibraries

Geoff Harder and Kenton GoodGeoff Harder and Kenton GoodDigital Preservation Seminar | March 5, 2010 | University of Digital Preservation Seminar | March 5, 2010 | University of AlbertaAlberta

Official children’s site of the 2000 Sydney Olympics - MIA:http://www.olympics.com/eng/kids/index.html?/eng/kids/home.html

GeoCities: GeoCities: 1995-20091995-2009

http://www.pcworld.com/article/163765/so_long_geocities_we_forgot_you_still_existed.html

Mind the Gap - UKMind the Gap - UK

““If websites continue to If websites continue to disappear in the same way as disappear in the same way as those on President Bush and the those on President Bush and the Sydney Olympics - perhaps Sydney Olympics - perhaps exacerbated by the current exacerbated by the current economic climate that is killing economic climate that is killing companies - the memory of the companies - the memory of the nation disappears too. Historians nation disappears too. Historians and citizens of the future will and citizens of the future will find a black hole in the find a black hole in the knowledge base of the 21st knowledge base of the 21st century.”century.”

Quote: http://www.guardian.co.uk/technology/2009/jan/25/internet-heritage

““New definitions need to be created New definitions need to be created for determining the scope of digital for determining the scope of digital special collections, so that special collections, so that stakeholders can understand the stakeholders can understand the nature of special collections nature of special collections professionals’ responsibilities. These professionals’ responsibilities. These include a responsibility for harvesting include a responsibility for harvesting and preserving endangered web and preserving endangered web sites, wikis and other dynamic sites, wikis and other dynamic information resources.”information resources.”

Digital Special Digital Special CollectionsCollections

Special Collections in ARL Libraries – March 2009A Discussion Report from the ARL Working Group on Special Collections

Looking ahead…Looking ahead… 234 million – The number of websites as of December 234 million – The number of websites as of December

2009.2009.

47 million – Added websites in 2009.47 million – Added websites in 2009.

126 million – The number of blogs on the Internet (as 126 million – The number of blogs on the Internet (as tracked by BlogPulse).tracked by BlogPulse).

27.3 million – Number of tweets on 27.3 million – Number of tweets on per day per day (November, 2009)(November, 2009)

350 million – People on 350 million – People on

4 billion – Photos hosted by (October 4 billion – Photos hosted by (October 2009).2009).

12.2 billion – Videos viewed per month on 12.2 billion – Videos viewed per month on in the US (November 2009).in the US (November 2009).

http://royal.pingdom.com

/2010/01/22/internet-2009-in-numbers/

Does the web matter?Does the web matter?Only if our cultural, Only if our cultural, historical, political, historical, political, economic, and social economic, and social memories matter.memories matter.

Valuable BUT vulnerable – Valuable BUT vulnerable – e.g. foundation losses e.g. foundation losses funding; can only afford funding; can only afford digital publishing.digital publishing.

Research and analysis – Research and analysis – longitudinal view requires a longitudinal view requires a complete picture.complete picture.

SOMEONESOMEONE needs to take needs to take responsibility for it. responsibility for it.

Web ArchivingWeb Archiving

Web Archiving Web Archiving is the is the process of collecting process of collecting portions of the World Wide portions of the World Wide Web and ensuring the Web and ensuring the collection is preserved in an collection is preserved in an archive, such as an archive archive, such as an archive site, for future researchers, site, for future researchers, historians, and the public. historians, and the public. Due to the massive size of Due to the massive size of the Web, web archivists the Web, web archivists typically employ web typically employ web crawlers for automated crawlers for automated collection. collection. Wikipedia, Wikipedia, “Web Archiving” “Web Archiving”

how web archiving how web archiving worksworks

A web crawler (ant, bot) is a computer A web crawler (ant, bot) is a computer program that browses and harvests program that browses and harvests (captures, collects) the World Wide Web in a (captures, collects) the World Wide Web in a methodical, automated manner. methodical, automated manner.

ARCHIVE-ITARCHIVE-IT

Web Archive Admin Web Archive Admin ScreenScreen

HCF CollectionHCF Collection

Seed ManagementSeed Management

ReportsReports

File Type ReportFile Type Report

Blocked Content Blocked Content Robots.txtRobots.txt

Web Archive Launch Web Archive Launch PagePage

Exposing Hidden Exposing Hidden ContentContent

U of A Web ArchiveU of A Web ArchivePartner with Internet Archive on the use Partner with Internet Archive on the use

of Archive-Itof Archive-ItThree targets: (criteria: thematic, Three targets: (criteria: thematic,

regional, event-based, organizational)regional, event-based, organizational)1)1)Heritage Community Foundation Heritage Community Foundation

(collection at risk)(collection at risk)2)2)University of Alberta websitesUniversity of Alberta websites3) 3) Western Canadian materials Western Canadian materials

(e.g. political websites)(e.g. political websites)

A few resourcesA few resources University of Alberta Web Archive: < www.archive-University of Alberta Web Archive: < www.archive-

it.org/home/ualwebarchive >it.org/home/ualwebarchive >

Archive-it! and Wayback Machine Archive-it! and Wayback Machine <www.archive.org/web/web.php><www.archive.org/web/web.php>

IIPC – International Internet Preservation ConsortiumIIPC – International Internet Preservation Consortium Use Cases for Access to Internet ArchivesUse Cases for Access to Internet Archives, IIPC Access , IIPC Access

Working Group, <netpreserv.org>Working Group, <netpreserv.org> Special Collections in ARL LibrariesSpecial Collections in ARL Libraries, Report March , Report March

20092009 GoC Web Archive GoC Web Archive

<http://www.collectionscanada.gc.ca/webarchives/ind<http://www.collectionscanada.gc.ca/webarchives/index-e.html>ex-e.html>

thanksthanks

Geoff HarderGeoff HarderDigital Initiatives Digital Initiatives

[email protected]@ualberta

.ca.ca

Kenton GoodKenton GoodWeb Development Web Development

[email protected]@ualberta.ca

caught in the web: web archiving at u of a libraries

Documents

web matter

web archivepartner

web archivists

web crawlers

web archivingweb archiving

world wide web

endangered web sites

archive site