the iipc web curator tool -...

27
1 1 The IIPC Web Curator Tool: Steve Knight The National Library of New Zealand Philip Beresford and Arun Persad The British Library An Open Source Solution for Selective Web Harvesting 2 What is the Web Curator Tool? How does it work? Who built it? Where can I get it? Today’s Topics

Upload: hatuyen

Post on 01-Apr-2018

222 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

1

1

The IIPC Web Curator Tool:

Steve Knight

The National Library of New Zealand

Philip Beresford and Arun Persad

The British Library

An Open Source Solution for

Selective Web Harvesting

2

• What is the Web Curator Tool?

• How does it work?

• Who built it?

• Where can I get it?

Today’s Topics

Page 2: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

2

3

What is the WCT?

• The Web Curator Tool (WCT) is a tool for managing

the selective web harvesting process.

• It is designed for use in libraries by non-technical users.

• Heritrix is used to download web material (but technical

details are handled behind the scenes by system

administrators).

4

What does it do?

• The WCT supports:

– Harvest Authorisation: getting permission to harvest web

material and make it available.

– Selection, scoping and scheduling: what will be harvested,

how, and how often?

– Description: Dublin Core metadata.

– Harvesting: Downloading the material at the appointed time

with the Heritrix web harvester deployed on multiple machines.

– Quality Review: making sure the harvest worked as expected,

and correcting simple harvest errors.

– Submitting the harvest results to a digital archive.

Page 3: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

3

5

What is it NOT?

• It is NOT a digital archive or document repository

– It is not appropriate for long-term storage

– It submits material to an external archive

• It is NOT an access tool

– It does not provide public access to harvested material

– (But it does let you review your harvests)

– You should use Wayback or WERA as access tools

6

What is it NOT?

• It is NOT a cataloguing system

– It does allow you to record external catalog numbers

– And it does allow you to describe harvests with Dublin Core

metadata

• It is NOT a document management system

– It does not store all your communications with publishers

– But it may initiate these communications

– And it does record the outcome of these communications

Page 4: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

4

7

The Web Curator Tool Project

• A joint project undertaken by National Library of New Zealand

and the British Library

• Each partner provided around half the funding

• Each partner contributed personel to the project team

8

The Web Curator Tool Project

• The project was initiated by NLNZ and BL under the auspices

of the IIPC Content Management Working Group

– Including requirements from several IIPC members

– BL project team members visted New Zealand for design workshops

in April 2006

– Part of the project’s success has been the speed of the design and

development phases

Page 5: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

5

9

Project Objectives

• Design and build a Web Curator Tool that:

– meets the needs of the National Library of New Zealand

– meets the needs of the British Library

– is modular and can be extended to meet the needs of IIPC membersand other organisations engaging in web harvesting

– manages permissions, selection, description, scoping, harvesting andquality review

– provides a consistent, managed approach allowing users with limitedtechnical knowledge to easily capture web content for archivalpurposes.

10

Project Timeline

May–July 2006Implementation

July–August 2006User Acceptance Testing

TodayOpen Source Release

March–April 2006Detailed Design

January–February 2006Procurement

November–December

2005

Project Definition, Solution Scope

Software Requirements Specification

To June 2005

To October 2005

IIPC Functional Requirements

IIPC Use Cases

Page 6: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

6

11

Technology

• Implemented in Java

• Runs in Apache Tomcat

• Incorporates parts or all of

– Acegi Security System

– Apache Axis (SOAP data transfer)

– Apache Commons Logging

– Heritrix (version 1.8)

– Hibernate (database connectivity)

– Quartz (scheduling)

– Spring Application Framework

– Wayback

12

More technology

• Platform:

– Tested on Solaris (version 9) and Red Hat Linux

– Developed on Windows

– Should work on any platform that supports Apache Tomcat

• Database:

– A relational database is required

– Tested on Oracle and PostgreSQL

– Installation scripts provided for Oracle and PostgreSQL

– Should work with any database that Hibernate supports

• Including MySQL, Microsoft SQL Server, and about 20 others

Page 7: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

7

13

How does it work?

14

Page 8: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

8

15

16

Harvest Authorisations

• WCT Harvest Authorisation is concerned with:

– Permission to harvest web material

– Permission to make web material accessible to users

– Any and all special conditions that apply

Page 9: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

9

17

Harvest Authorisation• A record is kept of:

– What authorisations have been granted (or rejected)?

• And what special conditions apply?

– Who has authorised us?

• The government, via legislation.

• Publishers, creators and copyright holders.

– Where (i.e. what part of the web) does the authorisation apply?

– When does the authorisation start and end?

• Communications…

– Can be initiated from within the tool,

– Documentation should be stored in your document management

system.

18

Page 10: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

10

19

20

Page 11: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

11

21

22

Page 12: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

12

23

24

Targets• A Target is a portion of the web you want to harvest.

• The Target is the “unit of selection”:

– If there is something you want to harvest and archive and describe,

then it is a Target.

• You can attach a Schedule to a Target to specify when (and

how often) it will be harvested.

– But you can’t harvest until you have permission to harvest,

and you can’t harvest until the selection is approved.

Page 13: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

13

25

26

Page 14: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

14

27

28

Page 15: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

15

29

30

Page 16: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

16

31

Target Instances

• Target instances represent individual harvests that

– Are scheduled to happen, or

– Are in progress, or

– Have finished.

• Target Instances are created automatically for a Target when

that Target is Approved.

– A Target Instance is created for each harvest that has been scheduled.

32

Target Instances - the Queue• Scheduled Target Instances are put in a queue

• When their scheduled start time arrives

1. The WCT allocates the harvest to one of the harvesters

2. The harvester invokes Heritrix and harvests the requested material

3. When the harvest is complete, the User is notified

• Examining the Queue gives you a good idea of the current

state of the system

– The WCT provides a quick view of the instances in the Queue,

including Running, Paused, Queued, and Scheduled Instances

Page 17: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

17

33

Target Instances

34

Target Instances - the User’s view• When a harvest is complete, its Owner is notified

• The Owner (or another User) then has to

1. Quality Review the harvest result to see if it was successful

• Browse Tool: Browse the harvest result to ensure all the content is there

• Prune Tool: Delete unwanted material from the harvest

2. Endorse or Reject the harvest

3. Submit the harvest to an Archive (if it has been endorsed)

• The User view of the Target Instances shows all the instances

that the user owns.

Page 18: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

18

35

36

Page 19: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

19

37

38

Page 20: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

20

39

40

Page 21: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

21

41

42

Page 22: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

22

43

44

Page 23: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

23

45

46

Page 24: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

24

47

48

Page 25: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

25

49

50

Immediate directions

• Streamlining user interface and workflow

• Develop additional quality review tools

• Improve support for other platforms and databases

• Integration with Wayback/WERA for access

Page 26: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

26

51

Future plans

• Development strategy

– Distribute the tool more widely

– Incorporate feedback from wider user bases

– Form a governance group to determine development strategy

• Tool development

– Fix bugs

– Develop new features

– Support new platforms and databases

52

Where can I get it?

• Available from http://webcurator.sourceforge.net

– Source code

– Documentation: user and administrator guides, FAQ

– Mailing lists

• Released under Apache License, Version 2.0

Page 27: The IIPC Web Curator Tool - iwaw.europarchive.orgiwaw.europarchive.org/06/PDF/iwaw06-knight-beresford.pdfThe IIPC Web Curator Tool: ... – Acegi Security System – Apache Axis

27

53

Questions?