impact final conference - apostolos antonacopoulos
DESCRIPTION
Case Study: Scanning ParametersTRANSCRIPT
The Effect of Scanning Parameters on OCR ResultsA Case Study
Apostolos Antonacopoulos
PRImA Lab, The University of Salford, United Kingdom
www.primaresearch.org
Outline
Background Image selection Methods and procedures Experiments
Experiment 1: Colour Vs. greyscale Vs. bitonal
Experiment 2: Effects of resolution Experiment 3: Comparison with NLNZ images
Conclusions
2
Background Cost of storage is a real issue for Content Holders Study by Tracy Powell and Gordon Paynter of the
National Library of New Zealand (DLIB 2009) opened a number of questions
Aims: Examine the effects of colour in addition to
greyscale and bitonal Examine the effects of producing bitonal
images in different ways Examine the effects of different resolutions Study the results by image rather than average
3
Image Selection
Qualitative selection Parts of newspaper articles (no layout issues) Variety of newspapers from British Library
collection Quality of overall page taken into account Regions of different quality selected from
same page Only text regions selected (no graphics
present) No additional artefacts (e.g. warping) present
4
Methods and Procedures
Regions marked using Aletheia and extracted from the main image as separate PAGE files
Text was keyed and represented in PAGE files
Selected (“standard”) colour reduction and binarisation methods were applied
ABBYY FineReader Engine 9 used for OCR IMPACT OCR evaluation tool used
5
Experiment 1: Colour/Grey/Bitonal6
Accuracy Variation per Image
7
Bitonal: Best Algorithm Vs. Scanner
8
Original with Large Bitonal Variation
9
BL9_r0
Experiment 2: Effects of Resolution
10
Experiment 3: Examine NLNZ Images11
Variations in Quality and Accuracy
12
Other bitonalalgorithmbetter NLNZ1_r1
Scanner bitonalbetter NLNZ4_r0
Conclusions Averages do not give an accurate picture. Different
decisions should be taken for different document types
Better quality images leave room for improvement (re-OCR), especially when accuracy is far from high 90s%
Current OCR systems are not taking advantage of extra quality?
Higher quality (at least greyscale) is an investment Perhaps not so high resolution for “routine” material
“Lossy” compression is a real option Better to have a high quality image with an
imperceptible “loss” than a perfect low quality image!
13
Further Information14
PRImAhttp://www.primaresearch.org
IMPACThttp://www.impact-project.eu