managing the digitization of large press archives
DESCRIPTION
From the 2014 DLF Forum in Atlanta, GA. Session Leaders Bassem Elsayed, Bibliotheca Alexandrina Ahmed Samir, Bibliotheca Alexandrina Managing the digitization of press material is quite a challenge; not only in terms of quantity, but also in terms of text and material quality, designing the workflow system which organizes the operations, and handling the metadata. This challenge has been the focus of the Bibliotheca Alexandrina’s digitization work during the past year in the course of its partnership with the Center for Economic, Judicial, and Social Study and Documentation (CEDEJ). Having more than 800,000 pages of press articles to be digitally preserved and publicly accessed, triggered an inevitable need to design a workflow that can manage such a massive collection and handle its attributes proficiently. The deployment of this endeavor required simultaneous intervention of four main aspects; data analysis of the collection, developing a digitization workflow for the collection at hand, implementing and installing the necessary software tools for metadata entry, and finally, publishing the digital archive online for researchers and public access. The presentation will demonstrate the workflow system which is being implemented to manage this massive press collection, which has yielded to date more than 400,000 pages. It will shed some light on the BA’s Digital Assets Factory (DAF), which is the nucleus upon which the digitization process of CEDEJ collection has been built. Additionally, the presentation will discuss the tools implemented for ingesting data into the digitization process starting form indexing until the creation of batches that are ingested into the system. The outflow will also be discussed in terms of organizing and grouping multipart press clips, in addition to the reviewing, validation and correction of the output. Light will also be shed on the challenges encountered to associate the accessible online archive with a powerful search engine supporting multidimensional search while maintaining a user-friendly navigation experience.TRANSCRIPT
The New Library of Alexandria Overview
Bibliotheca Alexandrina (BA)
Ø Center of excellence in the production and dissemination of knowledge
Ø Place of dialogue, learning and understanding between cultures and peoples
Ø The World’s Window on Egypt
Ø Egypt’s Window on the World Ø Instrument for Rising to the Challenges of
the Digital Age
Ø Center for Dialogue Between Peoples and Civilizations
Not just a Library of Books but rather a vast cultural and scientific complex
A library that can accommodate millions of books
7
http://archive.bibalex.org
8
14
15
http://descegy.bibalex.org
16
http://lartarab.bibalex.org
17
More than 230,000 Arabic books are freely available online for Arabic
readers worldwide
18
http://suezcanal.bibalex.org
19
20
http://naguib.bibalex.org/
21
http://nasser.bibalex.org
22
http://sadat.bibalex.org
Ø Project Overview Ø Collection Overview Ø Data Representation Ø System Workflow
� DAF (Digital Assets Factory) � Cataloguing � Website
§ Solr search Engine § Article Viewer
24
25
Ø Centre for Economic, Judicial, and Social Study and Documentation (CEDEJ) collaborated with Bibliotheca Alexandrina (BA) for the digitization of its archive of massive press articles collection
Ø The project consists of multiple modules to: � Index the Press Archive Collection � Control data entry workflow � Digitize and process data � Catalogue and review Articles � Archive Web Publishing
26
27
Ø Package of press archive � 800,000+ press clips varying between
§ Press § Reports
� 500+ publishers � 60,000+ writers and reporters � 200 Different subjects
§ Economic, politics, social life, etc… � Archive Languages:
§ Arabic, English and French � Date range from 1966 to 2009
28
Ø Finished so far � 115,000 press clips varying between
§ Press § Reports
� 200 publishers � 14,000 writers and reporters � 100 Different subjects
§ Economic, politics, social life, etc… � Archive Languages:
§ Arabic, English and French � Date range from 1966 to 2009
29
30
Ø A list of packaged press archive is submitted to
Bibliotheca Alexandrina to be scanned and catalogued
Ø Source of data is a collection of boxes Ø The box is organized on the following
hierarchy � Folder � File � Sub-File � Document
Ø Document represents a single page of press
31
32
33
34
35
36
37
38
Article Creation
39
Article Metadata
40
Lookups Management
41
Reports
42
43
44
45
Ø Based on Apache Lucene project v4.1
Ø SolrNet API is used to connect to Solr server
Ø Features � Simple/Advanced search � Results Highlighting � Fields AutoComplete � Text search (Article Viewer)
46
47
48
49
50
51
52
53
Ø Article viewer is used for previewing articles � It is one of multiple viewers developed at BA
Ø Architecture � Server Side: RESTful services � Client Side: JavaScript using JSONP
Ø Features � Image preview � Metadata preview � Text selection � Searching/highlighting � Zooming options: fit width/height
54
Ø Viewer Web Services � Metadata Web Service:
§ Retrieve article catalogue metadata § Return technical information (width, height, page
count..) � Content Web Service:
§ Retrieve the image of each single page in the article applying scaling to custom width and height responsively
§ Return the selected text based on the user highlighted area
� Search Web Service: § Perform the search using Solr engine APIs in the
content of the articles § Highlight the matching phrases in the article image
55
56
57
58