August 2005 IFLA - CDNL 1
The International Internet Preservation Consortium
(IIPC)
August 2005 IFLA - CDNL 2
Synopsis
• The IIPC - what is it?
• Background
• IIPC goals and organisation
• IIPC issues
• IIPC future?
• Concluding remarks
August 2005 IFLA - CDNL 3
The IIPC - What is it?• International collaboration for preserving Internet
content
• Mission: Acquire, preserve and make accessible Internet (WWW) content for future generations
• 12 participating institutions– National libraries of: Australia, Canada, Denmark, Finland,
France, Iceland, Italy, Norway, Sweden. The British Library (UK), The Library of Congress (USA) and the Internet Archive (USA)
• Chartered in Paris July, 2003, agreement in effect for 3 years
• Future not decided but IIPC seeks to involve national libraries
• IIPC welcomes inquiries about future membership
August 2005 IFLA - CDNL 4
Background
• The Internet is a specific medium with attributes of: – Books, journals, radio, images, video
• Characterised by– Exponential growth since 1994– Proliferation– Immense volume– Anybody can publish– Accessible everywhere
August 2005 IFLA - CDNL 5
Archiving the Web – WHY - Who Presently and in the future, a large and significant
part of our culture will exist ONLY on the Internet
• If the Web pages are not collected in an orderly and continuous manner they will disappear and thereby an important part of the worlds cultural and intellectual heritage
Therefore we should:• Preserve material that is only available on the Web• Preserve scholarly data and secure access to it because it is:
– Important and valuable– Cited– Finding and locating it is a problem
A logic extension of national libraries mission and goals
LEGAL DEPOSIT LAW
August 2005 IFLA - CDNL 6
1662
1697
2003
18861909-1941
1949
1977
Evolution of Legal deposit Law in Iceland
WWW
August 2005 IFLA - CDNL 7
Pre IIPC Development
• 1996 - Internet Archive, Sweden, Australia
• 1998 – Nordic co-operation
• 2000 - 2003 – Loc, BnF, UK, Austria, Slovenia, Check Republic, Lithuania, Canada
• IFLA 2002: Brewster Kahle presents the IA and Web
archiving
• September 2002 – IA proposes a project with a few libraries
• September 2002 – Meeting in Rome (during ECDL)
• January 2003 – Meeting in Paris (COBRA +)
• July 2003 IIPC incorporated
August 2005 IFLA - CDNL 8
IIPC Goals
To build a virtual global distributed collection to ensure that thedistributed and linked nature of the original web material is not lost
forever
Find a new way of collaborating among national heritage institutions
In order to create a network of heritage institutions
That can build and preserve the global distributed collection
Global information space of the Internet Global Distributed Collection
August 2005 IFLA - CDNL 9
IIPC Organisation
• Steering group one person from each institution
• Working groups– Access
– Content Management
– Deep Web
– Framework
– Metrics and Testbed
– Researchers Requirements
August 2005 IFLA - CDNL 10
IIPC ObjectivesCollaborative work, within each country's legislative framework, to
identify, develop and facilitate implementation of solutions for selecting, collecting, preserving and providing access to internet content
Facilitate international coverage of internet content archive collections within
national legal frameworks, in accordance with national collection policies
International advocacy for initiatives that encourage the collection, preservation and access to internet content
Provide a forum for sharing knowledge about internet content archiving both
within the Consortium and beyond
Develop and recommend standards
Develop interoperable tools and techniques to acquire, archive and provide
access to web sites
Raise awareness of internet preservation issues and initiatives through conferences, workshops, training events and publications.
August 2005 IFLA - CDNL 11
IIPC Results Intangible• Common understanding and clarification of issues
• Definition of the overall architecture for web archiving with system interface specifications
• Proposed standards for Web Archive file format and Metadata
• Access requirements with Use cases illustrating common understanding of the functionality of a web archive
• Identification and requirement specification of new access tools
• Curator tool for controlling and scheduling the collection of web content
• Definition of the the WARC (web ARChive) file format to store information blocks harvested by web crawlers
August 2005 IFLA - CDNL 12
IIPC Results
Tangible
• Heritrix Crawler/Harvester– “Smart crawling”– Continuous harvesting
• Full Text Indexer/Search Engine – searching/browsing the content of a Web Archive
• Extract data from an archived database
• Arc files manipulation tool
August 2005 IFLA - CDNL 13
IIPC Future - Issues
Collection building• Broad scope representative collection of Web• Narrow scope in depth collection of selected sites
Registration• Cataloguing is not possible• Indexing of text (with time element)
Access • Direct using a URL• Search Engine (Google type)• Data Mining (Analytical and statistical methods)
Long time preservation of a web archive a conscious omission
August 2005 IFLA - CDNL 14
IIPC Future
Current IIPC charter ends in July 2006
Proposals for continuation will be discussed at the next
meeting in late October 2005
Challenge is to keep the work focused and effective
Many unsolved problems and hopefully new members can
help
August 2005 IFLA - CDNL 15
Concluding remarks
Creating and accessing a Web Arcive is: • Very complex, challenging and exiting - not a problem
nor a burden• Collection – Preservation – Access
The first phase has started
Our knowledge of the Web and its contents is incomplete
All present software and tools must be improved
International cooperation needed to:• Define and develop standards, techniques and methods• Create national and even a global Web Archives• Provide access to the archives
August 2005 IFLA - CDNL 16
EXTRACT FROM THE ARCHIVE
ARCHIVE
TXT SOUNDVIDEOIMAGE
INDEXDATABASE
Createindex
Createindex
Createindex
Createindex
INTERNETBROWSER
August 2005 IFLA - CDNL 17
Books/Journals/Sound Rec. Video/Micro/CD’s Manuscr.Internet
INDEX
Films
National Bibliography reflecting new law
Bibliography of National Cultural Heritage
Gallery
Archive
Museum
National
National Bibliography - from Print to Digital
Present National Bibliography