digitizing newspapers - wordpress.com · 2017-09-13 · newspaper title that still exist. using a...

1
is embedded in the PDF. These files are available for researchers to download. These images are used for display in the Hoosier State Chronicles image viewer. They are compressed images but have greater data stability than regular JPGs. Two different types of XML files are produced. One set of XML files are named for each image of the issue and holds the OCR data. The other set of XML files is named by the issue date and holds the structural metadata that organizes a collection of images into a single newspaper issue. A TIFF file will also be created. It isn’t needed for HSC, but should be retained by your organization for preservation purposes. Starting with Microfilm For multiple rolls it will be worth your time to work with a business that specializes in microfilm digitization. Companies offer different levels of output for digitization. You’ll need to get cost estimates that include the creation of all of the files mentioned above. Working with microfilm creates an additional step during the research phase of the project. As part of research, complete an audit on all the film you wish to send so that you will have full numbers of how many images your vender will Indiana State Library began digitizing newspapers as part of the National Digital Newspaper Project (NDNP), a project administered by the National Endowment for the Humanities and the Library of Congress. Our participation in the program spurred the creation of our own website , a database of digitized Indiana newspapers comprised of our selections for the NDNP, newspapers digitized using LSTA grants, and projects supplied by libraries and history organizations within the state of Indiana. Though the Hoosier State Chronicles draws from a number of sources, we follow the standards developed for NDNP as much as possible. When digitizing newspapers, you can work with either paper copies or microfilm. Regardless of medium, a project will start and finish the same. The first step is to research the title you are interested in digitizing. You need to determine the copyright status of the newspapers you want to digitize. You also need to determine if the title has been digitized by another party. Digitization projects require a significant outlay of time or money. Discovering that your efforts duplicated content already available would be dissappointing. The final files will also be the same regardless of starting medium. Hoosier State Chronicles requires a PDF, JPG2000, and XML file for each page of an issue. Each file type serves a different purpose. These are usually PDF-A, a type of PDF where the Optical Character Recognition (OCR) text file Digitizing newspapers: A medium-oriented workflow Scan at 300 dpi or greater. Old newspapers are incredibly fragile. I’ve found it helpful to use a stiff sheet of poster board to move individual newspaper pages from the worktable to the scanner bed. I run the images through Photoshop making any corrections including converting images to grayscale. Converting images to grayscale helps make faded newspapers more legible. The Indiana State Library has one workstation equipped with DocWorks – a document processing application that can create metadata output in a number of formats including METS/ALTO favored for digitized newspapers. The final step is to load all the files onto the server and notify Veridian, who hosts our website, to harvest and update the site. Interested in sharing your newspapers? Contact Chandler Lighty ([email protected]) or Connie Rendfeld ([email protected]). be processing for you per roll. It is also good to note the dates published and if there are multiple editions for a date. Set up a spreadsheet where I enter the following information for each issue: the Reel Number, Title, Date, Volume number, Issue Number, Edition number, Missing pages, Duplicate pages, Total page count, and a Notes field. I use the notes field to enter any information I think might help at a later date. You’ll want to have the best possible image for your final product. Film that sees a lot of use develops scratches that can obscure the text. It is recommended that libraries obtain new copies of their microfilm and forward them to the vender for digitization purposes. In addition to the film, you’ll need to send an external drive for saving all your output files. When scanned, a microfilm reel can contain 40-50 GB worth of data. This is important to remember when deciding what size drive to purchase. Send the files you receive from your chosen vender to the Indiana State Library. We edit files to remove duplicate pages. We’re trying to present a newspaper as closely as how it appeared when it was originally published. Starting with Newspapers The number of microfilmed newspapers titles is small compared to the total number of newspapers published in the United States. There is a chance that no microfilm exists for a title you wish to digitize. The Indiana State Library has digitized a number of titles from donated bound books and randomly discovered stacks of old newspapers. It is time consuming but worthwhile to digitize from paper, since these might be the only copies of a newspaper title that still exist. Using a vendor can hide some of the time costs of digitization. We recently completed a project on behalf the Yellow Trail Museum in Hope, Ind. digitizing two volumes of the Hope Republican, an 8-page newspaper published weekly. Below is a breakdown of how long it took to accomplish each step. Time in hours Task 6-8 4 8-10 Scan 2 volumes Deskew, crop, set grayscale & batch files Process one batch in CSS DocWorks A batch is based on what our DocWorks workstation can comfortable handle. We aim to keep a batch under 100 pages. We were able to break the 104 issues into 11 batches. Based on the estimates of the chart we spent 98-122 hours to digitize two years of a weekly newspaper.

Upload: others

Post on 28-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Digitizing newspapers - WordPress.com · 2017-09-13 · newspaper title that still exist. Using a vendor can hide some of the time costs of digitization. We recently completed a project

is embedded in the PDF. These files are available for researchers to download.

��������These images are used for display in the

Hoosier State Chronicles image viewer. They are compressed images but have greater data stability than regular JPGs.��

Two different types of XML files are produced. One set of XML files are named for each image of the issue and holds the OCR data. The other set of XML files is named by the issue date and holds the structural metadata that organizes a collection of images into a single newspaper issue.

A TIFF file will also be created. It isn’t needed for HSC, but should be retained by your organization for preservation purposes.

Starting with MicrofilmFor multiple rolls it will be worth your time to

work with a business that specializes in microfilm digitization. Companies offer different levels of output for digitization. You’ll need to get cost estimates that include the creation of all of the files mentioned above.

Working with microfilm creates an additional step during the research phase of the project. As part of research, complete an audit on all the film you wish to send so that you will have full numbers of how many images your vender will

Indiana State Library began digitizing newspapers as part of the National Digital Newspaper Project (NDNP), a project administered by the National Endowment for the Humanities and the Library of Congress. Our participation in the program spurred the creation of our own website ���������������� ��������, a database of digitized Indiana newspapers comprised of our selections for the NDNP, newspapers digitized using LSTA grants, and projects supplied by libraries and history organizations within the state of Indiana. Though the Hoosier State Chronicles draws from a number of sources, we follow the standards developed for NDNP as much as possible.

When digitizing newspapers, you can work with either paper copies or microfilm. Regardless of medium, a project will start and finish the same.

The first step is to research the title you are interested in digitizing. You need to determine the copyright status of the newspapers you want to digitize. You also need to determine if the title has been digitized by another party. Digitization projects require a significant outlay of time or money. Discovering that your efforts duplicated content already available would be dissappointing.

The final files will also be the same regardless of starting medium. Hoosier State Chronicles requires a PDF, JPG2000, and XML file for each page of an issue. Each file type serves a different purpose.

���These are usually PDF-A, a type of PDF where

the Optical Character Recognition (OCR) text file

Digitizing newspapers: A medium-oriented workflow�� ����� �������������������������

��������Scan at 300 dpi or greater. Old newspapers are

incredibly fragile. I’ve found it helpful to use a stiff sheet of poster board to move individual newspaper pages from the worktable to the scanner bed. ����������� ��!� �"

I run the images through Photoshop making any corrections including converting images to grayscale. Converting images to grayscale helps make faded newspapers more legible.

��#��""���The Indiana State Library has one workstation

equipped with DocWorks – a document processing application that can create metadata output in a number of formats including METS/ALTO favored for digitized newspapers.

$� #��The final step is to load all the files onto the

server and notify Veridian, who hosts our website, to harvest and update the site.

Interested in sharing your newspapers? Contact Chandler Lighty ([email protected]) or Connie Rendfeld ([email protected]).���������������

be processing for you per roll. It is also good to note the dates published and if there are multiple editions for a date. Set up a spreadsheet where I enter the following information for each issue: the Reel Number, Title, Date, Volume number, Issue Number, Edition number, Missing pages, Duplicate pages, Total page count, and a Notes field. I use the notes field to enter any information I think might help at a later date. %��������&��#���"�#!��#'��(���#!� (

You’ll want to have the best possible image for your final product. Film that sees a lot of use develops scratches that can obscure the text. It is recommended that libraries obtain new copies of their microfilm and forward them to the vender for digitization purposes. ����������������������������������������������������������������� ������!�����!� (�!#����#��""����

In addition to the film, you’ll need to send an external drive for saving all your output files. When scanned, a microfilm reel can contain 40-50 GB worth of data. This is important to remember when deciding what size drive to purchase.$� #����#�� ��)##"����������* �#��� �"��

Send the files you receive from your chosen vender to the Indiana State Library. We edit files to remove duplicate pages. We’re trying to present a newspaper as closely as how it appeared when it was originally published.

Starting with NewspapersThe number of microfilmed newspapers titles is

small compared to the total number of newspapers published in the United States. There is a chance that no microfilm exists for a title you wish to digitize. The Indiana State Library has digitized a number of titles from donated bound books and randomly discovered stacks of old newspapers. It is time consuming but worthwhile to digitize from paper, since these might be the only copies of a newspaper title that still exist.

Using a vendor can hide some of the time costs of digitization. We recently completed a project on behalf the Yellow Trail Museum in Hope, Ind. digitizing two volumes of the Hope Republican, an 8-page newspaper published weekly. Below is a breakdown of how long it took to accomplish each step. ��(� ��+�(��,"��(����-�)#���.��'� ����

Time in hours Task

6-8

4

8-10

Scan 2 volumes

Deskew, crop, set grayscale & batch files

Process one batch in CSS DocWorks

A batch is based on what our DocWorks workstation can comfortable handle. We aim to keep a batch under 100 pages. We were able to break the 104 issues into 11 batches. Based on the estimates of the chart we spent 98-122 hours to digitize two years of a weekly newspaper.

���������"#$%&�����������'�����'���������������������������� ��������!��(����������������'���������������������������)������������!

$�����������������������������������*"+��������"#$��������,-.�����!��������'�������������,-.�������������������������������������������������������������!�

/���������������������'�������������������������������'������������������������������!��0���������������������������������'��������������������������)�����!

�����������1 2���)����������������������������������������������������!��3���������'�����4�'�����������������������������������������!������������5������������'�������������'����������������������������������,-.���������!��