the web is a mess: or how i learned to stop worrying and love web archiving lori donovan, internet...

33
The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Upload: curtis-curtis

Post on 27-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

The Web is a Mess: or How I Learned to Stop Worrying and

Love Web Archiving

Lori Donovan, Internet Archive

Page 2: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

We are a Digital Library

Mission Statement: Universal access to all knowledge

o Founded by Brewster Kahle in San Francisco, California in 1996

o Largest publicly available web archive in existence

o Officially designated a Library by the State of California in 2007

About Internet Archive

Page 3: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

What is Web Archiving?

The goal of web archiving is to document changes to web resources over time, archive them and make them accessible.

Page 4: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

What is a Web Archive?

A web archive is a collection of archived Urls grouped by theme, event, subject area, or web address.

A web archive contains as much as possible from the original resources. It is a priority to recreate the same experience a user would have had if they had visited the live site.

Page 5: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Why Web Archiving?

o Billions of people around the world have grown accustomed to using the web as their primary resource to acquire information.

o The web is a crucial part our culture and our social fabric, and we don’t want to lose any of it, so it is essential that we collect and preserve these digital resources and make them accessible in creative ways.

o The availability of this digital information is taken for granted and it is a fallacy that if something is on the web it will be there forever.

Page 6: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Limited lifespan of a webpage

It is a a fairly common misconception that content that exists on the web will remain there forever.

A report in Scientific American claims 44 days.

A subsequent academic study in IEEE suggests 75 days.

A Washington Post article indicates the number is 100 days.

Over 95% of government information today is born-digital. But less than 50% is being maintained with an active preservation plan. State of the Federal Web Report

Page 7: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Historically important events for researchers and scholars

Much of the record of any historic event in today’s world is “born digital.” And many items born in print are also available in digital form, or soon will be. To understand major world events—not only disasters but political upheavals—and to keep a record and a memory of them for survivors, for scholars, for policy-makers, and for a wider public, it is simply essential that we collect and preserve these digital resources and make them accessible in creative ways.

Andrew Gordon, Harvard University.

Page 8: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

It’s a requirement.

o Records Retention policy. Several state and federal laws or policies require universities to maintain various statistics and reports.

o Responsibility: preserve things like course information, course roster information and policies — documents now showing up only as digital content

Page 9: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

The Role of Libraries

o Libraries and archives have long collected information that serve scholars and the general public in understanding history, culture, and society.

o So much of today's information is easily (and only) found on the world wide web -- web pages have replaced hard copy records and documents, blogs are today's diaries, and newspapers and socio-political commentary exist solely online.

o As part of an effort to appropriately document and capture today's information for tomorrow's use, institutions must adopt a web archiving strategy.

o However, for many institutions, the prospect of capturing and storing web pages, websites, or entire web domains is a daunting prospect

Page 10: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

First deployed in February 2006• Web based application allowing users to create, manage and

preserve collections of digital content• Includes tools for selection and scoping, harvesting, cataloging

with metadata, full text search, and QA• Ability to capture content using 10 different crawl frequencies• Archived content includes: html, videos, audio, PDF, images,

social networking sites, online newspapers • View archived content within 24 hours after a capture is

complete• Annual subscription service, includes hosting, access and storage

(primary and back-up)

About Archive-It

Page 11: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

205 partners around the world in 43 U.S. States and 15 countries

Who Uses Archive-It?

Page 12: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

How Partners Use Archive-It

Page 13: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

o Essential part of a mandate to capture and preserve institutional memory and history. Construct an historical record of an institution’s web presence over time.

o Capture state/ local agency publications that aren’t being deposited in print form. Collect and aggregate state/ local government websites and presence.

o Capture websites that relate to historical/traditional collections and link them with existing collections around the same thematic focus.

o Create a thematic/topical web archive on a specific subject or event, including different perspectives and social commentary (tweets, blogs, comments). Gather thematically-related resources of value to researchers and scholars

o Support an electronic records system to meet record retentions requirements.

o Closure crawls

Archive-It Use Cases

Page 14: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Stanford University/New York UniversityIslamic & Middle Eastern Collection

Purpose: harvest and preserve Iranian Blogs

o Archiving 300+ blogs written by and for Iran and the Iranian people

o Includes coverage of 2009 Iranian elections and the current Middle East unrest

Page 15: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Stanford University/New York UniversityIslamic & Middle Eastern Collection

Page 16: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive
Page 17: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

University of Texas at Austin: LAGDA

Purpose: Archive documents from 18 different countries, 300 government ministries/presidencies.

Content includes:

oFull-text versions of official documents

oOriginal video and audio recordings of key regional leaders

oThousands of annual and "state of the nation" reports

oSpecific collections for Latin American elections and political parties

Page 18: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

University of Texas at Austin: LANICHonduras Presidential site 2008 (before the Coup)

Page 19: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

University of Texas at Austin: LANICHonduras Presidential site 2009 (during the Coup)

Page 20: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

University of Texas at Austin: LANICHonduras Presidential site (after the Coup)

Page 21: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Purpose: archive born digital literature – works created explicitly for the computer.

oELO seeks to foster and promote the reading, writing, teaching, and understanding of literature as it develops in a digital environment

oContent includes: individual works, collections and journals, poems and stories

Electronic Literature Organization

Page 22: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Electronic Literature Organization

Page 23: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Indiana University

Purpose: archive all university records to maintain strong electronic records systems

oMain university website, 8 different campus websites and other organizations on campus university culture, teacher blogs, student groups, and online publications

Page 24: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Indiana UniversityMain University website

Page 25: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Columbia University

Purposes: Archive copies of its university web presence in order to meet required mandates

Archive websites on thematic/topical subjects.

Page 26: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Columbia University Human Rights Collection

Page 27: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Columbia UniversityAvery Architectural & Fine Arts Library

Page 28: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Columbia University Archives Collection

Page 29: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

North Carolina State Archives & State Library of North Carolina

Purpose: archive state agency websites and publications

oIncludes pages in a variety of formats: text, images, audio, video and social networking sites

Page 30: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

North Carolina State Archives & State Library of North Carolina

Page 31: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Access to Collections

Partners:o Can view through private web application with login/password

General Public:o Can view from Archive-It website: http://www.archiveit.org/o Can view from organization’s website from a landing page that

links back to Archive-It hosted datao Host from organization’s own servers

-Restricted and private access options are available

Page 32: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

What’s next for Archive-It

Collaboration and Partnerships

Web application developmento Continue to develop features and functionalities

requested by partnerso Enhance our preservation policy/access modelo Integrate our data with partner’s external services,

systems and catalogs

Page 33: The Web is a Mess: or How I Learned to Stop Worrying and Love Web Archiving Lori Donovan, Internet Archive

Thank you!

Lori DonovanPartner [email protected]

Questions?