clear: a credible live evaluation method of website archivability, ipres2013

32
CLEAR: a Credible Live Evaluation Method of Website Archivability Vangelis Banos 1 , Yunhyong Kim 2 , Seamus Ross 2 , Yannis Manolopoulos 1 3 SEPT 2013 ▪ LISBON 1 Department of Informatics, Aristotle University, Thessaloniki , Greece 2 University of Glasgow, United Kingdom ARCHIVEREADY.COM

Upload: vangelis-banos

Post on 25-Dec-2014

646 views

Category:

Technology


0 download

DESCRIPTION

Abstract: Web archiving is crucial to ensure that cultural, scientific and social heritage on the web remains accessible and usable over time. A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is difficult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended nature of the web. The purpose of this work is to establish the notion of Website Archivability (WA) and to introduce the Credible Live Evaluation of Archive Readiness (CLEAR) method to measureWA for any website. Website Archivability captures the core aspects of a website crucial in diagnos- ing whether it has the potentiality to be archived with com- pleteness and accuracy. An appreciation of the archivability of a web site should provide archivists with a valuable tool when assessing the possibilities of archiving material and influence web design professionals to consider the implications of their design decisions on the likelihood could be archived. A prototype application, archiveready.com, has been established to demonstrate the viabiity of the proposed method for assessing Website Archivability.

TRANSCRIPT

Page 1: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

CLEAR: a Credible Live Evaluation Method of Website Archivability

Vangelis Banos1, Yunhyong Kim2, Seamus Ross2, Yannis Manolopoulos1

3 SEPT 2013 ▪ LISBON

1Department of Informatics, Aristotle University, Thessaloniki , Greece2University of Glasgow, United Kingdom

ARCHIVEREADY.COM

Page 2: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

2

Table of Contents1. Problem definition and related work,2. Our contributions,3. Website Archivability,4. CLEAR: A Credible Live Method to

Evaluate Website Archivability,5. Demonstration: http://archiveready.com/,6. Limitations and Future Work.

Page 3: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Problem definition• Web content acquisition is a critical step in the

process of web archiving;• If the initial Submission Information Package lacks

completeness and accuracy for any reason (e.g. missing or invalid web content), the rest of the preservation processes are rendered useless;

• There is no guarantee that web bots dedicated to retrieving website content can access and retrieve it successfully;

• Web bots face increasing difficulties in harvesting websites.

3

Page 4: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

4

• After web harvesting, administrators review manually the content and endorse or reject the harvested material.

• Web harvesting is automated while Quality Assurance (QA) is manual.

• Efforts to deploy crowdsourced techniques to manage QA provide an indication of how significant the bottleneck is.

Problem definition

Page 5: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Inspired by our work at

5

There is a need for a method to assesswebsite archive readiness in order to

support web archiving workflow.

building a blog preservation software platform

http://blogforever.eu

Page 6: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

6

1. the introduction of the notion of Website Archivability,

2. the definition of the Credible Live Evaluation of Archive Readiness (CLEAR) method to measure Website Archivability

3. ArchiveReady.com, a web application which implements the proposed method.

Our Contributions

Page 7: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

7

1. Mechanism to improve the quality of web archives.

2. Expand and optimize the knowledge and practices of web archivists, supporting them in their decision making, and risk management.

3. Standardize the web aggregation practices of web archives, especially QA.

4. Foster good practices in web development, make sites more amenable to harvesting, ingesting, and preserving.

5. Raise awareness among web professionals regarding preservation.

Our Aims

Page 8: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

WebsiteArchivability ?

What is

Website Archivability captures the core aspects of a website crucial in diagnosing whether it has

the potentiality to be archived with completeness and accuracy.

Attention! it must not be confused with website dependability, reliability, availability, safety, security, survivability, maintainability.

Page 9: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

CLEAR: A Credible Live Method to Evaluate Website Archivability• An approach to producing on-the-fly measurement

of Website Archivability,• Web archives communicate with target websites via

standard HTTP,• Information such as file types, content and transfer

errors could be used to support archival decisions,• We combine this kind of information with an

evaluation of the website's compliance with recognised practices in digital curation,

• We generate a credible score representing the archivability of target websites.

9

Page 10: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

10

Accessibility

Cohesion

StandardsCompliance Performance

Metadata

CLEAR: A Credible Live Method to Evaluate Website Archivability

Page 11: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

11

Website attributes evaluated using CLEAR

Page 12: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

12

C L E A R• The method can be summarised as follows:

1. Perform specific Evaluations on Website Attributes,

2. In order to calculate each Archivability Facet’s score,• Scores range from (0 – 100%),• Not all evaluations are equal, if an important

evaluation fails, score = 0, if a minor evaluation fails, score = 50%

3. Producing the final Website Archivability as the sum all Facets’ scores.

Page 13: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Accessibility

13

Page 14: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Accessibility• A website is considered accessible only if web

crawlers are able to visit its home page, traverse its content and retrieve it via standard HTTP requests.

14

Page 15: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Accessibility

15

Facet Evaluation Rating Total

Accessibility

No RSS feed 50%

50%No robots.txt 50%

No sitemap.xml 0%

6 links, all valid 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 16: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Cohesion

16

Page 17: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Cohesion• Relevant to:

• Efficient operation of web crawlers,• Management of dependancies with digital

curation.• If files constituting a single website are dispersed

across different web locations, the acquisition and ingest is likely to risk suffering if one or more web locations fail.

• Changes that occur outside the website are not going to affect it if it does not use 3rd party resources.

17

Page 18: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Cohesion

18

Facet Evaluation Rating Total

Cohesion

1 external and no internal scripts 0%

70%

4 local and 1 external images 80%

No proprietary (Quicktime & Flash) files

100%

1 local CSS file 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 19: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Metadata

19

Page 20: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Metadata• The adequate

provision of metadata has been a continuing concern within digital curation.

• The lack of metadata impairs the archive’s ability to manage, organise, retrieve and interact with content effectively.

20

Page 21: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Metadata1. What is it2. How is it calculated

21

Facet Evaluation Rating Total

Metadata

Meta description found 100%

87%HTTP Content type 100%

HTTP Page expiration not found 50%

HTTP Last-modified found 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 22: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Performance

22

Page 23: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

PerformancePerformance is an important aspect of web archiving. The throughput of data acquisition of a web spider directly affects the number and complexity of web resources it is able to process.

23

Facet Evaluation Rating Total

Performance Average network response time is 0.546ms

100% 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 24: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

StandardsCompliance

24

Page 25: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Standards Compliance• Compliance with standards is a recurring theme in

digital curation practices. It is recommended that for digital resources to be preserved they need to be represented in known and transparent standards.

25

Page 26: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Standards Compliance

26

Facet Evaluation Rating Total

Standards Compliance

1 Invalid CSS file 0%

87%

Invalid HTML file 0%

Meta description found 100%No HTTP Content encoding 50%HTTP Content Type found 100%HTTP Page expiration found 100%HTTP Last-modified found 100%No Quicktime or Flash objects 100%5 images found and validated with JHOVE 100%

http://ipres2013.ist.utl.pt/ Website Archivability evaluation on 23rd April 2013

Page 27: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

iPRES 2013 Website Archivability Evaluation

27

Facet Rating Website Archivability

Accessibility 50%

77%Cohesion 70%

Standards Compliance 77%

Metadata 87%

Performance 100%

Page 28: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

ArchiveReady.comDemonstration

- Web application implementing CLEAR,

- Web interface & also Web API in JSON,

- Running on Linux, Python, Nginx, Redis, Mysql.

28

Page 29: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

29

Page 30: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

Impact

30

1. Web professionals - evaluate the archivability of their websites in an easy but thorough way, - become aware of web preservation concepts, - embrace preservation-friendly practices.

2. Web archive operators - make informed decisions on archiving websites, - perform large scale website evaluations with ease, - automate web archiving Quality Assurance, - minimise wasted resources on problematic websites.

Page 31: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

31

Limitations & Future Work1. Not optimal to treat all Archivability Facets as equal.

2. Evaluating a single website page, based on the assumption that web pages from the same website share the same components and standards. Sampling would be necessary.

3. Certain classes and specific types of errors create lesser or greater obstacles to website acquisition and ingest than others. The method needs to be enhanced to reflect this differential valuing of error classes and types.

Page 32: CLEAR: a Credible Live Evaluation Method of Website Archivability, iPRES2013

THANK YOUVangelis BanosWeb: http://vbanos.gr/Email: [email protected]

ANY QUESTIONS?

32

The research leading to these results hasreceived funding from the European Commission Framework Programme 7 (FP7), BlogForever project, grant agreement No.269963.