pandora

42
Trends in Use of Pandora Archive Presentation at IIPC Open Day The Broad Value of Web Archives30 th April, 2012, Library of Congress Monica Omodei Director, Web Archiving and Digital Preservation National Library of Australia momodei @ nla.gov.au

Upload: national-library-of-australia

Post on 20-Jun-2015

546 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Pandora

Trends in Use of Pandora Archive

Presentation at IIPC Open Day “The Broad Value of Web Archives”

30th April, 2012, Library of Congress

Monica Omodei Director, Web Archiving and Digital Preservation

National Library of Australia momodei @ nla.gov.au

Page 2: Pandora

About the Pandora Archive •  Selective, Collaborative Approach "

–  high value, discrete, timely collecting"– A number of partners contribute to Pandora"

•  Targeted Australian content "–  selection policy, nominations are reviewed"

•  Historical – started 1996"•  Bibliocentric approach "

– archived sites/publications are fully catalogued"•  Publicly accessible"

–  full content keyword search through national resource discovery service trove.nla.gov.au

– Browse is of reconstituted version of original site – Metadata indexed in google"

Page 3: Pandora

Pandora Archive Stats

•  Size – 6.32 TB"•  Number of Files > 140 million"•  Number of ‘titles’ > 30.5K"•  Number of title instances > 73.5K"

Page 4: Pandora

Whole domain archive • We have also commissioned the IA to crawl

the .au domain for us annually since 2005

• Legislation prevents us from making this accessible yet

• Hopefully soon we will be able to allow access to researchers

Page 5: Pandora

Australian web domain crawls

Year! 2005! 2006! 2007! 2008! 2009! 2011!

Files! 185 million!

596 million!

516 million!

1 billion! 765 million!

660 million!

Hosts crawled!

811,523! 1,046,038! 1,247,614! 3,038,658! 1,074,645! 1,346,549!

Size (TBs)! 6.69! 19.04! 18.47! 34.55! 24.29! 30.71!

Page 6: Pandora
Page 7: Pandora
Page 8: Pandora
Page 9: Pandora
Page 10: Pandora
Page 11: Pandora

The Bad News •  we have no legal deposit legislation for electronic

publications so permission to archive must be obtained"– significant content missed because permission to

copy refused"•  QA and fixing process can be labour intensive"

– Technical infrastructure ten years old"•  Selection guidelines outdated and dont align"•  Significant content missed because of resourcing

constraints and high labour cost"•  Search and browse functionality very limited"

– no URL search, no time-based searching"•  Current infrastructure doesnʼt scale for broader

themed collections with multiple sites or for domain-scale archiving

Page 12: Pandora

Glass half full •  Situation will improve markedly if Legal Deposit

provisions extended to digital publications"– The Australian Attorney-General has released a

consultation paper with a model for this extension"•  Broader coverage will be achieved when

infrastructure is upgraded, improving scalability and reducing labour costs for QA/fixing – We have commenced a multi-year Digital Library

Infrastructure Replacement Project which includes upgrading our web archiving tools"

– We are currently trialling Heritrix for collaborative thematic collecting, and wayback for access to our commissioned .gov.au sub-domain archive"

Page 13: Pandora

DLIR Project • Digital Library Infrastructure Replacement"• RFP was followed by RFT for components

where reasonable solutions had been proposed (including core repository)"

• The RFT evaluation recommended proceeding to contract negotiations with the selected tenderer for each component"

• Currently preparing a submission for ministerial approval prior to contract negotiations with vendors.

Page 14: Pandora

Patterns of Use

•  Which archived sites are popular and why ?"

•  Is use of our archive growing ?"•  What is the relative interest in

older vs more recent captures ?"•  Who is using our archives ?"•  And what for ?

Page 15: Pandora

Which archived sites are popular ? •  Data source – filtered, aggregated web

access log data which counts access to “titles”"

•  Examined top 30 archived titles (# of accesses) for each year 2009 to 2012"

•  Selected some to examine and speculate as to why they might be popular"

•  Included consistently high ranking, and ones that were very variable between years

Page 16: Pandora

Reasons for popularity of archived version •  Were once popular and are now

decommissioned, particularly if domain name continues to exist and redirects to the archive"

•  May not be that popular as live sites but their live site links prominently to Pandora as an archive for their content"

•  Popular referencing sources cite the archive as well as the live site (if it still exists)

Page 17: Pandora
Page 18: Pandora
Page 19: Pandora
Page 20: Pandora
Page 21: Pandora
Page 22: Pandora
Page 23: Pandora
Page 24: Pandora
Page 25: Pandora
Page 26: Pandora
Page 27: Pandora

Conclusions •  Be more proactive in identifying

unresponsive domains "•  Market automatic redirect

services to web site owners/managers"

•  Allow Google to index archive content for sites which are no longer ‘live’"

Page 28: Pandora

Is use of Pandora growing ? Annual access figures for Pandora Web Site and Archive

NB robots.txt was not introduced on the site until 2005 Web site design change in 2008 affected measure downward

Page 29: Pandora

Interest in older vs recent content • Filtered access logs by reference

from the entry page to the archived instance

• aggregated accesses by age(year) of archived instance

• Added number of instances of that age in the archive as a reference

Page 30: Pandora

Age of instances accessed

Page 31: Pandora

Who is using archive ."

• Online survey linked to from search service - approx 450 respondents

• Age, gender, location, education

• How did they arrive

• What type of information and for what purpose

•  Is it still available on the live web ?

Page 32: Pandora

But first an anecdote Article in major newspaper – quote

WE at Spring Loaded are no conspiracy theorists, but the disappearance of Liberal Party policies is curious. First went the policy documents. A recent revamp of the website saw the pre-election press releases go. But thanks to the National Library of Australia’s Internet archive, many of the policies can be seen at http://pandora.nla.gov.au When Spring Loaded asked about the missing policies, the Liberal Party said there was “nothing untoward”.

Page 33: Pandora

Examples of lost web sites

• Qantas own special web site presenting their case during the major dispute with pilots, engineers and cabin crew unions that grounded the airline in 2011

• Jeff Kennett's campaign web site in the 1999 Victorian State election - the first use of the web by a politician during a campaign in Australia

Page 34: Pandora
Page 35: Pandora
Page 36: Pandora
Page 37: Pandora

About the respondents

Page 38: Pandora

How did they arrive ?

Page 39: Pandora

What information was sought ?

Page 40: Pandora

What for ?

Page 41: Pandora

Other questions

• Did you realise that you were going to enter an archived version of a web site, not the live one (60% yes to 40% no)

• Was the resource you were looking for no longer available on the live web ? (50-50)

• Have you visited other web archives ? (60% yes, 40% no)

Page 42: Pandora

Conclusions • We need to market our archive better

• Promote redirects for closing, unsupported web sites

• Convert archives to arc/warc so memento API will find content

• allow google indexing of content for archived web sites where live version is extinct or substantially altered