technical aspects of digitization -...

56

Chapter 4

Technical Aspects of Digitization

4.1 Digitization

The process of converting printed resource into digital resource is digitization. What are the

advantages of digitization that have encouraged the digitization of vast amount of analogue

resources? “The aim of expressing an object in numbers is that it can be stored and

manipulated by computers. Computers are number crunchers, performing millions of

calculations per second. By digitizing an original and placing a digital copy of it on a

computer, the file can be manipulated, transferred, and stored with ease” (Wentzel, Larry,

2006, p. 11).

4.2 Process of Digitization

Digitization is the process of converting something into the numerical form that can be

processed by computers. The main objective behind digitization is the storage and

manipulation and access to the resources. The requirement of less storage space and the

access of the resources without wear and tear by a large number of users have encouraged

the digitization of the printed resources.

“The process of digitization involves two major sets of activities: (i) The process of digital

conversion whereby source materials are converted into digital form, and (ii) The

processing of the digitized information, which involves several activities related to the

storage, organization, processing and retrieval of digitized information” (Choudhury &

Choudhury, 2007, p.104). The stages that are involved in the digitization process are

scanning, indexing, storing, and retrieval of which detailed discussion is made below.

57

Scanning

Using an electronic image scanner or a digital camera, the source document which is in

printed form is converted to an electronic image. In this process, the source document is

scanned at a predefined resolution and bit depth. The images are stored in files where for

each pixel the binary digits (bits) are stored and it is called “bit-map page image”. The

software used for scanning are used for formatting, tagging, storage and retrieval of the

scanned image.

Indexing

In this step, the scanned image files are indexed by linking the database of the scanned

image to a text database. The text database links the set of images according to keyword

and location of the image in the image database. Some scanning software does manual

keying in of the indexing term to the image files, while some others facilitates selection of

indexing term from the image files.

Storing

The file of the scanned image is saved or stored for further processing. The file size of these

files depends upon many factors like resolution used in scanning, the scan area,

compression technique used, and file format used for the scanned image. The scanned

image is stored in offline storage media like CD-ROM or DVD-ROM, external hard disc,

snap servers, etc.

Retrieval

Retrieval is a part of scanning failing which scanning will be of no use. While scanning a

document it is stored in the machine resulting two files. The first file will hold the image

along with a key to the second file where the location of the document is stored. In

retrieving the document already scanned the second file of which the key is linked with the

first file retrieves the document from the system (Arora, 2001).

58

4.3 Technological Background of Digitization

Digital images are represented by set of pixels or bits. These bit-mapped images cannot be

searched like an ASCII file. But by applying Optical Character Recognition (OCR)

technology, a bit-mapped image file can be converted to an ASCII file. Some technological

specifications responsible to control the quality of the scanned image are discussed below.

Bit Depth

Bit is the abbreviated form of Binary Digit. 0 and 1 are the two values of bit. Bits are used

to describe the range of shades between pure black and pure white. Black and white files

are called 1-bit, as there are two shades, black and white. The bit depth or colour depth of a

scanner is an indication of the range of colours that can be captured by the scanner. It does

not define the limits of the colour range that is readable by the device but simply specifies

the number of separate distinct colours. A higher figure will equate to a more accurate

description of the colours available to the scanner but does not necessarily mean that they

are available to the user at the end of the process.

Scanners will often capture at a larger ‘bit depth’ of 36-42 bit and then save or export from

the scanner in standard 24 bit RGB (Red Green Blue) colour. This extended colour depth is

used internally by the scanner to produce the best possible quality original image data but is

not normally available to the user. Although recently, there has been a move towards some

scanners allowing the full size ‘hi-bit’ version of the file to be saved and edited as a 48 bit

TIFF (Taged Image File Format) or PNG (Portable Network Graphic). The colour depth, in

itself, does not provide much evidence of the quality of the scanner, however it does give

some guidance to how capable the scanner might be if it can use all the colour data it

produces.

Resolution

Before going for scanning a document the resolution is to be decided in the form of dpi

(dots per inch) or ppi (pixels per inch) which indicated the quality of the document scanned.

It is to be noted that the higher we accept the resolution the more the dpi/ppi will be

59

(Wentzel, 2006). This can be decided at a certain level which depends on document to

document. “The higher the resolution, the finer the grid used to segment the image” But the

higher the resolution used, the file size will be more, i.e. resolution and the file size are

related proportionally.

Optical and interpolated resolutions are the two different resolution types based on how

they are generated. Optical resolution is the maximum number of resolution a scanner is

capable of capturing. Interpolated resolution is artificially generated where the software

gets pixels captured by the scanner, expands the grid pattern, and estimates the pixels that

were captured by the scanner.

Many recommendations are put forwarded for selection of proper resolution to achieve

good quality scanning for different types of documents. Wentzel (2006) in “Scanning for

Digitization Projects” has put forwarded the following recommendations.

• Normal web image- 72 dpi GIF/JPEG

• Minimum gray/color print setting-150 dpi JPEG

• Optimal color print setting- 300 dpi TIFF

• Optimal setting for running pages of text through OCR- 300 dpi TIFF

• Best black and white print setting- 600 dpi TIFF

• Archival setting (all colors) - 600 dpi TIFF

Again, the Digital Library Federation (DLF) has also recommended to use 300 dpi 24-bit

color TIFF for images and 600 dpi 1-bit bitonal TIFF for pages of text

<www.diglib.org/standards/bmarkfin.htm#benchmark>. Based on the facilities available

and the type of documents to be scanned, the resolution may be adjusted.

Threshold

To scan the pages where text or drawings are there, bitonal scanning is used. It is also

known as binary or black and white scanning where one pixel is represented by one bit. In

black and white photograph where intermediate or continuous tones are there, gray scale

60

canning is used. For the scanning of colour photographs, colour scanning is used. Bitonal

scanning has the fastest processing. On the other hand, grayscale will provide more

accurate results, especially on degraded or shaded background documents. Colour scanning

helps to retain colour information and/or colour graphics in the source document. “The

threshold setting in bitonal scanning defines the point on a scale, usually ranging from 0 to

255, at which gray values will be interpreted as black or white pixels” (Arora, 2001, p. 17).

The threshold setting determines the image quality in bitonal scanning.

Compression

The size of the scanned image is very big if the source document is scanned with high

resolution. Therefore, to make the files manageable by the computer system and by the

user, it is necessary to reduce or compress the file size.

“Compression is the process of reducing the size of a data file or an image by abbreviating

the repetitive information such as one or more rows of white bits to a single code” (Arora,

2001, p. 17). It helps in economic storage, processing and transmission over a network.

Data compression algorithms are of two types- lossless and lossy.

a) Lossless compression

It uses algorithms which encode repeating elements or patterns within an image. If in an

image same colours are present in more than one adjacent pixels then two bytes are used for

storing the information. The first byte is used for the colour and the second for the number

of adjacent pixels. When the file is decompressed, the original image is restored.

b) Lossy compression

In this type, the compression ratio is much higher than lossless ratio. But the quality of the

image degrades in lossy compression. Some of the commonly used compression protocols

are –

i) ITU-G4: Developed by International Telecommunication Union (ITU), is a popular

standard protocol for black and white images.

61

ii) JPEG: Joint Photographic Expert Group (JPEG) is an ISO-10918-I compression

protocol. It represents an area that has the same tone, shade, colour, or other characteristics

by a code.

iii) LZW: Lenpel-Ziv-Welch (LZW) uses a table-based lookup algorithim invented by

Abraham Lempel, Jacob Ziv, and Terry Welch are two commonly used file formats in

which LZW compression is used are the Graphic Interchange Format(GIF) and the Tagged

Image File Format(TIFF).(Arora, 2001, p. 19)

iv) Fractal and wavelet compression: These lossy compression formats offer advantages

for providing access to digital images of oversized materials on the web. It converts the

image into mathematical models instead of an array of pixels and thus save storage space.

Image Enhancement

The image enhancement process can improve the quality of the image that is captured by

using the scanning device. Image editor software helps in this process. “For archiving and

online publishing of images image editor is a must. We can resize images, crop, create

image for website, save in multiple formats” (Deka, 2008, p. 171). According to Arora

(2001) we can decompose the scan area into small areas and can be treated for further

improvement of the image quality. There is lots of image editing software which can be

used for image enhancement like Adobe photoshop, PaintShop Pro, etc.

4.4 File Formats

File format for storage, dissemination and preservation of digital resources is one of the

most important technical issues to be taken into consideration. “ One of the key

components in ensuring resource longevity is the choice of file and media formats used to

create, store, and deliver digital content, and the strategies that are employed to manage

these in the long term” (Williamson, 2005, p. 508). File Formats stores different

information like size, resolution, compression protocols, etc. The scanned image can be

stored in different types of file formats for easy storage and retrieval. PDF, SGML, TIFF,

62

MPEG, WAVE are some popular file formats used for storing scanned images. We have

mainly two types of file formats which are as follows.

Open File Format

Open file format which is freely available for use is free from patent or license issue and

can be used by anyone in any proprietary or free or open source software.

An open standard approach brings a wide range of benefits (Williamson, 2006). These are –

• Resources are freed from dependencies on a single application or particular

hardware platforms;

• Resources can be preserved and accessed over the long term.

Open Document, Office Open XML, PNG, JPEG 2000, ZIP are some of the examples of

open file formats.

Proprietary File Format

Proprietary file format is owned either by an individual or an organization and they protect

it from unauthorized use by using the patent or license. “ These formats are owned by an

organization or group (e.g. Microsoft), may sometimes be accepted as de facto standards

through sheer ubiquity, and might even be referred to as standards, but cannot be regarded

as open since the owner could theoretically choose to change the format or conditions of

usage at any time” (Williamson, 2005, p. 509).

A list of file formats for different media types along with the creator, date of creation,

media types and formats is given in the next page.

63

Table 4.1 List of File Formats

Sl.

No

File Name File

Extension

Creator Creation

Date

Media

Type

Format

1 Advanced

Audio

Coding

.aac Collaboration between

corporations approved

by MPEG

1997 Sound Lossy

Compression

2 Advanced

Authoring

Format

.aaf Advanced Media

Workflow Association

2000 Moving

Image

Uncompressed

3 Apple

QuickTime

.mov Apple Computer, Inc. 1991 Moving

Image

Container

4 Audio

Interchange

File Format

.aiff Electronic Arts

Interchange and Apple

Computer, Inc.

1988 Sound Uncompressed

5 Audio Video

Interleave

.avi Microsoft 1992 Moving

Image

Container

6 Bitmap .bmp IBM and Microsoft 1988 Still

Image

Compressed or

Uncompressed

7 Broadcast

Wave File

.bwav IBM and Microsoft 1997 Sound Uncompressed

8 Digital Video

File

.dv or .dif Sony 1994 Video Uncompressed

9 Extensible

Music Format

.xmf The MIDI

Manufacturers

Association, XMF

Working Group

2001 Moving

Image

Container

10 Final Cut Pro .fcp Final Cut Pro/Apple

Computer, Inc.

1999 Moving

Image

Uncompressed

11 Flash Video .swf (or

.flv)

Adobe/Macromedia 1997 Moving

Image

Moving

Image/Dynamic

12 Graphics .gif CompuServe 1987 Still Lossless

64

Interchange

Format

Image Compression

13 JPEG .jpg Joint Photographic

Experts Group

1990 Still

Image

Lossy

Compression

14 Keynote .key Apple Computer, Inc. 2003 Presenta

tion

Container

15 Material

Exchange

Format

.mxf Pro-MPEG Forum 2004 Moving

Image

Container

16 MPEG-1 or

MPEG-2

.mpg Motion Picture Experts

Group

1988 Moving

Image

Container

17 MPEG-1/2

Audio Layer

3

.mp3 Motion Picture Experts

Group

1991 Sound Lossy

Compression

18 MPEG-4 .mp4 Motion Picture Experts

Group

1998 Moving

Image

Container

19 Ogg Vorbis

Compressed

Video

.ogm Ogg Vorbis 2003 Moving

Image

Container

20 Open Office

Impress

.odp Sun Microsystems 2000 Presenta

tion

Container

21 Photoshop

Document

.psd Adobe 1990 Still

Image

Uncompressed

22 Portable

Network

Graphics

.png The Portable Networks

Graphics Development

Group of the World

Wide Web Consortium

1996 Still

Image

Lossless

Compression

23 Power Point

Document

.ppt Microsoft 2003 Presenta

tion

Container

24 Raw Image

File

.dng, .cr2,

.nef, .arw,

and .srf

Depends on equipment

manufacturer

2000 Still

Image

Uncompressed

65

25 RealAudio

File Format

.ra RealMedia 1995 Sound Compressed

26 Scalable

Vector

Graphics

.svg The World Wide Web

Consortium

1999 Still or

Moving

Image

Uncompressed

27 Tagged

Image File

Format

.tiff Aldus 1985 Still

Image

Container or

Uncompressed

28 WAVE Form

Audio Format

.wav IBM and Microsoft 1992 Sound Uncompressed

(Source: www.nyu.edu/tisch/preservation/...2/07f_1807_nmartin_a2.doc)

4.5 Hardware Used for Digitization

For capturing the image of the source document we need some devices. Scanner is

generally used for image capture from textual document, image or from other sources. A

discussion regarding the hardware used in the process of digitization is given below.

4.5.1 Scanner

Scanners can be called as a photocopier. In case of a flatbed scanner, a moving lamp throws

light onto the object to be digitized and the reflected light is focused through a series of

mirrors and lenses onto the recording medium. In case of a flatbed scanner, the recording

medium is a compact light sensor, either a CCD (Charged Coupling Device) or CIS

(Contact Image Sensor), each of which is composed of hundreds or thousands of elements.

When light strikes each element the intensity of the light is assigned a number. The numeric

reading of light intensity and the element position are recorded in sequence into a file

which forms the digital version of the original. Following features should be analysed first

in a scanner selection process.

66

a) Driver of scanner

Driver is a software that operates the scanner and transfer the digitized file to the hard drive

or software. The scan driver may be a standalone or a plug-in, a specialized version of the

driver that is accessible through Photoshop, word or other programme. The standalone

driver runs the scanner without involving other software and saves the file to the hard drive.

Plug-ins are opened within Photoshop or word and after scanning and the files can be used

immediately in Photoshop or Word.

Scan driver falls into two groups: native and third party. Flatbed scanner manufacturers

provide their own native driver for their scanner and provide updates for the drivers through

the website. In case of specialized scanners, such as overhead book scanners or the digital

cameras, the native driver is the only driver available.

Third party scan drivers offer better control over the scanner and scanned image than the

native drivers. These drivers are to be procured unless they are supplied with the scanner as

an incentive. Windows Image Acquisition (WIA) is a third party scan driver provided by

the Microsoft Windows XP. It has offered the most commonly available features used by

all flatbed scanners. However, the specialized scanners cannot be operated with the WIA.

b) Scanning speed

Scanning times varies depending upon the type of scanner used. Within a busy workflow,

scanning speed often can be a deciding factor in scanner choice and should always be

researched and considered before a choice is made. Many scanners offer a choice of

differing qualities of scan which is dependent upon the number of passes and/or speed of

the CCD: the more passes the CCD makes, the higher the quality and the slower the

scanning speed. Some early scanners were unable to scan Red, Green and Blue data in one

go (one-pass) and had to make three separate scans (three-pass). This does not normally

affect the quality but was very slow. Some scanners offer functions such as dust and noise

reduction, however, this also slows down the process significantly.

67

c) Scan area

The dimension or the area the scanner is capable of scanning is the scan area. The scan

areas are determined by inches and/or media sizes such as

8 ½ X 11 inch (standard letter)

8 ½ X 14 inch (legal)

11 X 17 inch (ledger)

Most flatbed scanners have a nominal size of A4 but can scan an area of about 8.5" by 12-

14". A3 sized scanners are available but they can take up a considerable amount of space.

They are, of course, essential if it becomes necessary to capture works (overA4) although if

the objects are very large or difficult to handle a digital camera might well offer a more

pragmatic alternative. Hi-end A3 flatbed scanners are very popular with commercial

digitization as they can be set up to scan a number of images at one go. This offers greatly

increased efficiency and increased throughput. But these machines are very costly.

Some flatbed scanners offer the addition of dual optics where the optional system can be

switched to scan a ‘sweet-zone’ which offers a smaller scan area with a greatly increased

resolution. This is normally of use when scanning small to medium sized transparencies

within the full size of the scanner bed.

There are range of optional add-on parts that can provide additional functionality and

productivity for many mid-range to high-end scanners. Two of the most common options

for flatbed scanners are the automatic sheet/transparency feeder (ASF/ATF) and the

transparency media adapter (TMA). An ASF or ATF is used to batch scan quantities of

single sheets or transparencies. Normally ASF/ATF is best for creating small and low

quality scans, either 1-bit black and white images from text for later optical character

recognition (OCR) or small scans for thumbnail creation. TMA provides an alternative light

source within the scanner which enables transparent artworks such as photo-slides and

larger colour transparencies to be scanned.

68

d) Scanner types

The selection of the right scanner is a more difficult job than selecting the right computer.

Scanners are used to capture the image of the resources in printed form or from the

microfilm. There are two types of image scanner based on interpretation of the image;

vector scanner and raster scanner. The vector image interprets the image as a set of x, y

coordinates. In case of raster scanner images are captured by passing light down the page

and digitally encoding it row by row.

i) Drum scanner: Drum scanners use photo-multiplier tubes (PMT) to produce very high

quality results. They typically have a density range of 3.4-4.0 with a ‘dMax’ at the top of

that range. They can offer an optical resolution of up to 8000 samples per inch (spi). Drum

scanners are the tool of choice of the print industry and normally used by professional

digitization bureaux. This is due to their expense and their complexity requiring skilful

operation to get the best from them. Only flexible original artwork can be scanned in a

drum scanner as it has to be mounted on a transparent acrylic cylinder (drum) and then spun

at high speed around the photo-multipliers within the cylinder. Mounting transparencies on

the drum is a slow and skilled operation and it is normal to have at least two drums in use

so that one can be mounted whilst the other is being scanned.

Fig. 4.1: A Drum Scanner

Although the quality from these scanners is exemplary, they tend to be slow and cannot

normally provide the level of productivity required from most digitization projects. There

69

are also some preservation issues with the standard use of a mounting oil to avoid Newton’s

rings between the transparency and the drum. If mounting oil is used then the

transparencies must be scrupulously cleaned after scanning.

ii) Flatbed scanner: It is like a photocopier where a lamp moves slowly across the face of

the original and the reflected light is focused through a series of mirrors and lens onto the

recording medium. Here, the recording medium is compact light sensor, either a Charged

Coupling Device (CCD) or Contact Image Sensor (CIS), each of which is composed of

hundreds or thousands of elements. When light strikes each element the intensity of the

light is assigned a number. The numeric reading of light intensity and element position are

recorded in sequence into a file which forms the digital version of the original. To enable

the scanner to capture colour, they must either make three passes with a Red, Green or Blue

filter in front of the CCD or have 3 lines of CCD each with either a Red, Green or Blue

filter on top.

Fig. 4.2: Flatbed Scanner (HP Scanjet G2410)

70

Flatbed scanners are much cheaper than drum scanners and also much easier to operate.

The technology and the quality of CCD have improved a lot and still cheaper than drum

scanners. Another advantage of it is that it can be operated by unskilled operators as its

functions are simple. The document to be scanned does not need to be bent around a drum.

Flatbed scanners also offer more scanning speed than drum scanners. Lots of flatbed

scanners are available in the market. The major printer production companies have their

low cost flatbed scanners which can be used for scanning photographs and loose sheet

pages.

iii) Overhead scanner: This type of scanner is quite expensive as compared to flatbed

scanner, but when we need to capture the image of extremely fragile materials it can be

helpful. We should avoid the overhead scanner that scans only in black and white. A

photograph of Zeutschel overhead scanner is a popular scanner used by LICs and resource

centres for digitization is given below.

Fig. 4.3: An Overhead Scanner (Zeutschel os 5000)

71

Zeutschel Scanners can be used to digitise books, magazines and other large documents.

Special and careful procedures and functions for books are used during scanning. This

includes book cradles, radiographic tables, innovative light systems and the creation of

documents with the text facing upwards. Depending on customer needs, Zeutschel offers

different models for colour, greyscale and black/white.

iv) Sheet-fed scanner: In this type of scanner, we have to slide sheets of paper through the

scanner. It is not good for capturing images of loose manuscripts, photographs, fragile

materials, etc.

v) Microfilm scanner: It is a good choice for microfilm, photographs, slides and negatives.

But it has the limitation of size of the scanning. The microfilm produced from the original

documents can be preserved in ideal condition for a very long time.

Fig. 4.4: Microfilm scanner (B-M-I EYECOM MIC5M)

The steady growth of digital imaging technology over the last five years has led to a vast

range of professional and consumer scanners in the market. Quality and speed are steadily

rising and the cost is slowly falling down. However, it remains true that although it is

72

possible to buy fast low-quality scanners or slow high-quality scanners at a cheaper price,

productive and high-quality scanners tend to still be very expensive.

4.5.2 Digital Camera

Digital camera is a good choice for digitization of not only the valuable documents of an

organization but we can use it for different other purposes like taking the photographs of

the organization and its different sections, the staff etc. and can upload these on the website

of the organization. When we have to digitize the damaged materials which cannot be

moved and captured the image without disturbing their position, investing in a digital

camera is a better choice. Any modern DSLR (Digital Single Lens Reflex Camera) or

point-and-shoot digital camera can be used as a document scanner. We can use a DSLR

with a dedicated flash and a lens with some measure of zoom (18-55mm or 18-200mm). In

order to do this properly, the light in the room where scanning is done should be good

enough.

Fig. 4.5: Digital Camera Used as Document Scanner

73

It is to be properly aligned with the document; otherwise we will get slightly skewed shots

which could be a problem. We can use holding arms in order to fasten the camera in place

while taking the photographs. Most tripods will not angle down enough for this to work but

if we place the document on an easel, it would be feasible to find the right angle for

alignment. The researcher has seen using digital camera of Sony to capture image of rare

documents while visiting the University Library of Osmania University.

4.6 Software Used for Digitization

The scanner can only capture the image of the source document which has to be processed

further for enhancing the image quality, image clarity, or make it searchable and accessible

by the user in future. For these purposes, we need software like scanning software and

Optical Character Recognition (OCR) software.

4.6.1 Scanning Software

For the proper operation of the scanner, we have to install the driver and the scanning

software for a particular scanner. In this regard, we have to install the driver and the

scanning software for a particular scanner. Scanner software controls the scanning process

as well as driving the hardware that captures the image data and passes it on to the next

stage of the image workflow. This software usually offers a range of image processing

features. Software can either be a device-specific program designed to work with one

scanner or a plug-in based on a driver interface such as TWAIN or ISIS which can be

accessed from within a host program.

Software can play an important role within a workflow in terms of productivity and quality

of the scan, so it is important to consider how best to combine the work undertaken by

scanning software with that done by image processing software. In addition to setting

resolution, scan area and colour greyscale, reflective/transmissive quality, the scanner

software can also be used to control colour optimization, colour transmission, sharpening,

74

tonal optimization, automated dust/scratch removal, negative to positive image selection,

scan quality control, image rotation, batch scanning, etc.

Using any of these facilities at the time of acquiring the image can save a lot of time in

corrective manipulation later on in the workflow, but it is worth comparing the performance

of these functions between the scanner software and the image processing software when

deciding which is going to be more effective. Some of scanning software FreeKapture,

VueScan etc.

FreeKapture 2.0: It is a free Twain image capture application from TSoft that works on

any Windows (98 and on) Twain compliant system. TWAIN is, allegedly, an acronym for

Technology Without An Interesting Name and is software (a driver) supplied by the

manufacturer of TWAIN complaint devices. Using this driver, FreeKapture is able to scan,

save and print images (photographs etc.). Images are saved in JPG or BMP formats.

VueScan: It is an easy-to-use replacement for the software that comes with scanner and

supports most flatbed scanners, printer/scanners and film scanners. Over 10 million people

have downloaded VueScan since it was first released in 1998. VueScan is a powerful

scanning tool. It is packed with loads of useful and powerful features and currently supports

over 1200 scanners and 321 digital camera RAW files.

Scanitto Pro: Scanitto Pro provides one-click scanning and copying utilizing TWAIN

drivers which provide exceptional scan and copy quality. In addition, Scanitto Pro

integrates with all major operating systems to provide a seamless document management

environment which is intuitive and very simple to use. Scanitto Pro is extremely stable and

has passed all the major security and operational tests. It supports multiple file formats like

PDF, BMP, JPG, TIFF, JP2 and PNG files. Scanitto Pro supports all major European

languages supported including English, French, German, Italian, Spanish & Russian.

75

4.6.2 OCR Software

A scanned document is nothing but a picture of a printed page. It cannot be edited or

manipulated or managed or searched based on the content. In other words, scanned

documents have to be referred to by their labels rather than characters in the documents.

OCR (Optical Character Recognition) software is used to transform scanned textual page

image into word processing file. The function of OCR software is to convert the captured

image or set of images and generate a file containing that text in ASCII code or in a

specified word processing format leaving the image intact in the process.

OCR does not actually convert an image into text but rather creates a separate file

containing the text. There are four types of OCR technology namely matrix matching,

feature extraction, structural analysis and neural networks. In matrix matching, each

character is compared with a template of the same character. In feature extraction

technology, a character is recognized from its structure and shape based on a set of rules. In

structural analysis, the characters are determined on the basis of density gradations or

character darkness. A form of artificial intelligence is used in neural networking technology

which attempts to minimize the human effort by using fuzzy logic technology and it is also

known as ICR (Intelligent Character Recognition). There are lots of OCR software

available in the market now-a –days. ABBYY FineReader 11 and OmniPage Pro are two of

the widely used OCR software.

ABBYY FineReader 11: With new support for Arabic (Modern Standard), Vietnamese

and Turkmen (Latin), ABBYY FineReader 11 detects any combination of 189 languages.

FineReader 11 supports a wide range of output formats. The OCR results can also be sent

directly to applications such as Microsoft Word, Excel and PowerPoint, Adobe Acrobat,

Corel, WordPerfect and OpenOffice.org TM Writer. It has cutting-edge image correction

tools which adjust motion blur, ISO noise, 3D image distortion, brightness, contrast, color

levels and curved text for the best possible results.

76

OmniPage: The newest version of OmniPage utilizes the latest OCR software technology

with greatly increased accuracy and innovative cloud service capabilities and recognition of

123 languages. OCR loses its convenience if the software is too difficult or confusing to

use. Such is the risk with any multi-featured software. OmniPage easily navigates around

this risk with its intuitive design and logical layout. Even an OCR rookie could navigate

through the many features of this software.

4.7 Storage Space of Scanned Image: An Experimental Study

Two files – one textual of size 19.2 kb in docx file format and the other image file of size

577 kb in docx file format were created and print outs were taken. Both the pages were

scanned using two different types of flatbed scanner. One scanner is Avision FB6280E is

an A3 Bookedge scanner and the other one is Canon image Class D 520. The textual

document was scanned in different resolutions using black and white option and the file is

saved in different file format in both the scanner. In the following table, the different file

size of the images saved in different file formats is given.

Table 4.2 File Size of B/W Scanned Image

Sl No. File format File size in FB6280 File size in Canon image Class D

520

200 dpi 300 dpi 600 dpi 200 dpi 300 dpi 600 dpi

1 pdf 152 kb 315 kb 1.04 mb 34.1 kb 95 kb 67 kb

2 bmp 10.6 mb 23.1 mb 95.8 mb 464 kb 1 mb 4.02 mb

3 tiff 6.08 mb 15.5 mb 59.1mb 457 kb 1 mb 1.78 mb

4 jpg 169 kb 350 kb 1.22 mb ------- ------- -------

5 gif 430 kb 1.29 mb 4.12 mb -------- ------- --------

Similar process was applied for the image printout and was scanned using colour option.

The respective file size of the two different types of scanned document saved in different

file formats are presented in the following table.

77

Table 4.3 File Size of Colour Scanned Image

Sl No. File format File size in FB6280 File size in Canon image Class D

520

200 dpi 300 dpi 600 dpi 200 dpi 300 dpi 600 dpi

1 pdf 127 kb 283 kb 1.09 mb 57.3 kb 101 kb 1 mb

2 bmp 6.32 mb 14.2 mb 56.9 mb 10.7 mb 24 mb 96.2 mb

3 tiff 5.28 mb 12.4 mb 48.6 mb 10.7 mb 24 mb 96.2 mb

4 jpg 459 kb 336 kb 1.28 mb 277 kb 629 kb 2.70 mb

5 gif 507 kb 1.20 mb 4.75 mb -------- -------- --------

From the table 4.2 and 4.3, it is found that the file size of the same document scanned in

two different scanners saved in different file formats in same resolution is different. The

file sizes of the scanned image increase when the documents are scanned using different

resolution. Higher the resolution used in scanned, greater is the file size. The qualities of

the scanned images are found to be good in higher resolution.

4.8 Summing Up

Digitization has many sides to be dealt with from scan area, resolution to file formats of

storing. Selection of hardware and software is also a factor of successful digitization

project. The university libraries can opt for either in-house or outsourcing process to

digitize their rich collection. The university libraries can approach institutes like CDAC-

Noida, CDAC-Pune, IIIT Allahabad, Indira Gandhi National Centre for the Arts, New

Delhi to provide necessary infrastructure and manpower for digitization of their valuable

and rare documents; provided the conditions laid down by the respective bodies are

acceptable by the university libraries.

technical aspects of digitization -...

Documents