caught in the web: web archiving at u of a libraries

26
Caught in the Web: Caught in the Web: Web Archiving at U of A Web Archiving at U of A Libraries Libraries Geoff Harder and Kenton Good Geoff Harder and Kenton Good Digital Preservation Seminar | March 5, 2010 | Digital Preservation Seminar | March 5, 2010 | University of Alberta University of Alberta

Upload: suzuki

Post on 14-Feb-2016

65 views

Category:

Documents


0 download

DESCRIPTION

Caught in the Web: Web Archiving at U of A Libraries. Geoff Harder and Kenton Good Digital Preservation Seminar | March 5, 2010 | University of Alberta. Official children’s site of the 2000 Sydney Olympics - MIA: http://www.olympics.com/eng/kids/index.html?/eng/kids/home.html. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Caught in the Web:   Web Archiving at U of A Libraries

Caught in the Web: Caught in the Web: Web Archiving at U of A Web Archiving at U of A LibrariesLibraries

Geoff Harder and Kenton GoodGeoff Harder and Kenton GoodDigital Preservation Seminar | March 5, 2010 | University of Digital Preservation Seminar | March 5, 2010 | University of AlbertaAlberta

Page 2: Caught in the Web:   Web Archiving at U of A Libraries

Official children’s site of the 2000 Sydney Olympics - MIA:http://www.olympics.com/eng/kids/index.html?/eng/kids/home.html

Page 3: Caught in the Web:   Web Archiving at U of A Libraries

GeoCities: GeoCities: 1995-20091995-2009

http://www.pcworld.com/article/163765/so_long_geocities_we_forgot_you_still_existed.html

Page 4: Caught in the Web:   Web Archiving at U of A Libraries

Mind the Gap - UKMind the Gap - UK

““If websites continue to If websites continue to disappear in the same way as disappear in the same way as those on President Bush and the those on President Bush and the Sydney Olympics - perhaps Sydney Olympics - perhaps exacerbated by the current exacerbated by the current economic climate that is killing economic climate that is killing companies - the memory of the companies - the memory of the nation disappears too. Historians nation disappears too. Historians and citizens of the future will and citizens of the future will find a black hole in the find a black hole in the knowledge base of the 21st knowledge base of the 21st century.”century.”

Quote: http://www.guardian.co.uk/technology/2009/jan/25/internet-heritage

Page 5: Caught in the Web:   Web Archiving at U of A Libraries

““New definitions need to be created New definitions need to be created for determining the scope of digital for determining the scope of digital special collections, so that special collections, so that stakeholders can understand the stakeholders can understand the nature of special collections nature of special collections professionals’ responsibilities. These professionals’ responsibilities. These include a responsibility for harvesting include a responsibility for harvesting and preserving endangered web and preserving endangered web sites, wikis and other dynamic sites, wikis and other dynamic information resources.”information resources.”

Digital Special Digital Special CollectionsCollections

Special Collections in ARL Libraries – March 2009A Discussion Report from the ARL Working Group on Special Collections

Page 6: Caught in the Web:   Web Archiving at U of A Libraries

Looking ahead…Looking ahead… 234 million – The number of websites as of December 234 million – The number of websites as of December

2009.2009.

47 million – Added websites in 2009.47 million – Added websites in 2009.

126 million – The number of blogs on the Internet (as 126 million – The number of blogs on the Internet (as tracked by BlogPulse).tracked by BlogPulse).

27.3 million – Number of tweets on 27.3 million – Number of tweets on per day per day (November, 2009)(November, 2009)

350 million – People on 350 million – People on

4 billion – Photos hosted by (October 4 billion – Photos hosted by (October 2009).2009).

12.2 billion – Videos viewed per month on 12.2 billion – Videos viewed per month on in the US (November 2009).in the US (November 2009).

http://royal.pingdom.com

/2010/01/22/internet-2009-in-numbers/

Page 7: Caught in the Web:   Web Archiving at U of A Libraries

Does the web matter?Does the web matter?Only if our cultural, Only if our cultural, historical, political, historical, political, economic, and social economic, and social memories matter.memories matter.

Valuable BUT vulnerable – Valuable BUT vulnerable – e.g. foundation losses e.g. foundation losses funding; can only afford funding; can only afford digital publishing.digital publishing.

Research and analysis – Research and analysis – longitudinal view requires a longitudinal view requires a complete picture.complete picture.

SOMEONESOMEONE needs to take needs to take responsibility for it. responsibility for it.

Page 8: Caught in the Web:   Web Archiving at U of A Libraries

Web ArchivingWeb Archiving

Web Archiving Web Archiving is the is the process of collecting process of collecting portions of the World Wide portions of the World Wide Web and ensuring the Web and ensuring the collection is preserved in an collection is preserved in an archive, such as an archive archive, such as an archive site, for future researchers, site, for future researchers, historians, and the public. historians, and the public. Due to the massive size of Due to the massive size of the Web, web archivists the Web, web archivists typically employ web typically employ web crawlers for automated crawlers for automated collection. collection. Wikipedia, Wikipedia, “Web Archiving” “Web Archiving”

Page 9: Caught in the Web:   Web Archiving at U of A Libraries

how web archiving how web archiving worksworks

A web crawler (ant, bot) is a computer A web crawler (ant, bot) is a computer program that browses and harvests program that browses and harvests (captures, collects) the World Wide Web in a (captures, collects) the World Wide Web in a methodical, automated manner. methodical, automated manner.

Page 10: Caught in the Web:   Web Archiving at U of A Libraries

ARCHIVE-ITARCHIVE-IT

Page 11: Caught in the Web:   Web Archiving at U of A Libraries

Web Archive Admin Web Archive Admin ScreenScreen

Page 12: Caught in the Web:   Web Archiving at U of A Libraries

HCF CollectionHCF Collection

Page 13: Caught in the Web:   Web Archiving at U of A Libraries

Seed ManagementSeed Management

Page 14: Caught in the Web:   Web Archiving at U of A Libraries

ReportsReports

Page 15: Caught in the Web:   Web Archiving at U of A Libraries

ReportsReports

Page 16: Caught in the Web:   Web Archiving at U of A Libraries

File Type ReportFile Type Report

Page 17: Caught in the Web:   Web Archiving at U of A Libraries

Blocked Content Blocked Content Robots.txtRobots.txt

Page 18: Caught in the Web:   Web Archiving at U of A Libraries

Web Archive Launch Web Archive Launch PagePage

Page 19: Caught in the Web:   Web Archiving at U of A Libraries
Page 20: Caught in the Web:   Web Archiving at U of A Libraries
Page 21: Caught in the Web:   Web Archiving at U of A Libraries
Page 22: Caught in the Web:   Web Archiving at U of A Libraries
Page 23: Caught in the Web:   Web Archiving at U of A Libraries

Exposing Hidden Exposing Hidden ContentContent

Page 24: Caught in the Web:   Web Archiving at U of A Libraries

U of A Web ArchiveU of A Web ArchivePartner with Internet Archive on the use Partner with Internet Archive on the use

of Archive-Itof Archive-ItThree targets: (criteria: thematic, Three targets: (criteria: thematic,

regional, event-based, organizational)regional, event-based, organizational)1)1)Heritage Community Foundation Heritage Community Foundation

(collection at risk)(collection at risk)2)2)University of Alberta websitesUniversity of Alberta websites3) 3) Western Canadian materials Western Canadian materials

(e.g. political websites)(e.g. political websites)

Page 25: Caught in the Web:   Web Archiving at U of A Libraries

A few resourcesA few resources University of Alberta Web Archive: < www.archive-University of Alberta Web Archive: < www.archive-

it.org/home/ualwebarchive >it.org/home/ualwebarchive >

Archive-it! and Wayback Machine Archive-it! and Wayback Machine <www.archive.org/web/web.php><www.archive.org/web/web.php>

IIPC – International Internet Preservation ConsortiumIIPC – International Internet Preservation Consortium Use Cases for Access to Internet ArchivesUse Cases for Access to Internet Archives, IIPC Access , IIPC Access

Working Group, <netpreserv.org>Working Group, <netpreserv.org> Special Collections in ARL LibrariesSpecial Collections in ARL Libraries, Report March , Report March

20092009 GoC Web Archive GoC Web Archive

<http://www.collectionscanada.gc.ca/webarchives/ind<http://www.collectionscanada.gc.ca/webarchives/index-e.html>ex-e.html>

Page 26: Caught in the Web:   Web Archiving at U of A Libraries

thanksthanks

Geoff HarderGeoff HarderDigital Initiatives Digital Initiatives

[email protected]@ualberta

.ca.ca

Kenton GoodKenton GoodWeb Development Web Development

[email protected]@ualberta.ca