Performance Evaluation and Quality Assessment
Stefan Pletschacher
Europeana Newspapers Information Day
London, 9 June 2014
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Scope
• Intelligent/value-adding processes in digitisation projects
(as opposed to “mechanical” tasks)
• Objective performance indicators for individual processing
steps
• Objective quality measures for overall results of refinement
processes
• In-depth analysis of specific problems
(not just Pass/Fail QA)
2
Europeana Newspapers Information Day, London, 9 June 2014
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Importance of PE&QA in Digitisation Projects
• Planning
• Feasibility
• Prioritisation
• Costs, time, manual steps, specialist software
• Services, output formats
• Implementation
• Setup of workflows
• Identification of bottlenecks
• Optimisation of individual processing steps
• Monitoring and controlling
• Agreed quality levels (OCR tools and commissioned services)
3
Europeana Newspapers Information Day, London, 9 June 2014
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Digitisation Workflows and Evaluation Approaches
① Scanning
② Image enhancement Page splitting
Border removal
Dewarping (page curl, arbitrary warping)
Noise removal
Binarisation
③ Layout analysis Segmentation of regions, lines, words and characters
Region classification
Logical layout analysis
④ OCR
⑤ Post-processing
4
Europeana Newspapers Information Day, London, 9 June 2014
• Individual processing
steps vs.
entire workflow
• Direct vs. indirect
• Based on use
scenarios
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Performance Evaluation Overview
5
Europeana Newspapers Information Day, London, 9 June 2014
Evaluation
Tools
Image
Repository
Evaluation
Results
Compatibility through
one common format
(PAGE)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Miss / Part.
Miss
Split
Misclass.
Merge
False
Detection
Layout Analysis (Segmentation and Classification)
6
Europeana Newspapers Information Day, London, 9 June 2014
Source: NLT/USAL
Types of errors Ground truth Result
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Reading Order Detection
7
Europeana Newspapers Information Day, London, 9 June 2014
Ground
truth
Result
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Text Recognition
• Comparison of Ground Truth and OCR output based on encoded text (ASCII, Unicode)
• Normalisation
• Character accuracy
• Distance measure: minimum number of edit operations (insertions, deletions,
substitutions)
• Per character class (lower case, upper case, whitespace characters, numbers, symbols)
• Word accuracy
• Correctly recognised words vs. total word count
• Bag of words (index, ranking)
• Stop words and non-stop words
• Rejected and suspicious characters/words
• Substitution errors (higher penalty)
• OCR confidence ≠ accuracy
8
Europeana Newspapers Information Day, London, 9 June 2014
“OCR is cool” “OOR is cod”
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Interpretation of Results
9
Europeana Newspapers Information Day, London, 9 June 2014
• Metrics
• Measurements of conditions
• Types and number of errors
• Scenarios
• Application context
• Error weights
Miss
Misclass.
Merge
Split
False detect.
Merge
Rate
M1
M2
M3
Split
Rate
S1 S2
...
Error
Rate
• Overall success/error rates are based on
• weighted individual results
• type and size of affected regions
• allowable vs. non-allowable errors
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Use Scenarios
10
Europeana Newspapers Information Day, London, 9 June 2014
Use Scenario Frequency
Keyword search in full text 82%
Image browsing - provenance (incl. title, year etc) 76%
Phrase search in full text 76%
Information aggregation (linking to related resources) 59%
Metadata-based search 53%
Crowd sourcing (correction/enrichment) 53%
Semantic search (respecting context) 41%
Access via content structure (article tracking, TOC etc.) 41%
Geolocation based services 29%
Print/eBook on demand 29%
Access through mobile Apps 29%
Translation (incl. search term translation for retrieval, historical - modern language)
29%
Content-based image retrieval (textual description and/or image as query ) 29%
Image browsing - categories (like advertisement, image) 24%
Text mining 24%
Search hit highlighting 24%
Content summarisation 18%
Social media integration (and vice versa integration in social media) 18%
Repurposing/Reformatting 18%
Recommendations 12%
Screen reader (text to speech) 12%
Information retrieval (incl. queries in natural language, fuzzy search etc) 12%
Negative search (eliminate results according to unwanted keywords) 6%
Intended/conceivable
use scenarios, based
on a survey among 17
project partners in the
Europeana Newspapers
project (2013)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
11
Europeana Newspapers Information Day, London, 9 June 2014
Layout Quality
OCR Accuracy
Text Eval
Layout Eval
PAGE XML
Layout
Text Content
Aletheia
Web Aletheia
Crowd Prototype
Tesseract Exporter
FineReader Exporter
Document Image
Typewritten OCR
Segmenter
Repositories
Converter Validator
Dewarping
Image Tool
Metadata Extractor
Extractor
Exporter
Snippet
Serialised Text
SimplePageExporter C++
JAletheia
Sandbox
PAGE to SVG XSD
Optimiser
Layout correspondence,
reading order
Validation Conversion
Filtering
Bag of Words, Character and word accuracy
Dewarping
Eval …
Threshold, Otsu, Sauvola binarisation
Image and PAGE XML snippets
Gamera XML
(PAGE Scanner)
Tool
Prototype
Data
Java
Web
Command Line
ALTO XML FineReader XML
PRImA Tools
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Ground Truth Production
• Aletheia
• Page border
• Print space
• Layout regions
(incl. metadata)
• Text lines
• Words
• Glyphs
• Unicode text
• Reading order
• Layers
• Ground Truth Validator
12
Europeana Newspapers Information Day, London, 9 June 2014
• FineReader Engine Exporter (Preproduction)
• Ground Truth Converter/Normaliser
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Evaluation Tools
• Segmentation,
Classification, and
Reading Order
• OCR Text
• Deskewing
• Dewarping
• Border Removal
• Binarisation
• Double Page Splitting
13
Europeana Newspapers Information Day, London, 9 June 2014
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Case Studies
Usability and Potential of Existing Material
When to re-process existing material?
• Evolutionary Improvements in OCR Technology
• Specifically Trained OCR Engines
• Dedicated Language and Font Support
• Re-scanning
YMMV – results depend strongly on the material in question
14
Europeana Newspapers Information Day, London, 9 June 2014
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Evolutionary Improvements in OCR Technology
ABBYY
FRE9 FRE10
15
Europeana Newspapers Information Day, London, 9 June 2014
72.7%73.7%
67.4%68.7%
60%
65%
70%
75%
80%
FRE 9 FRE 10
Succ
ess
Rat
e
OCR Engine
OCR Performance (Bag of Words)
OCR Performance (count based) OCR Performance (index based)
80.9%
75.5%
69.6%68.0%
98.5%
96.4%
85.1%
79.0%
90.5%
85.1%
71.8%70.8%
67.9%65.9%
60%
65%
70%
75%
80%
85%
90%
95%
100%
FineReader 9 FineReader 10
Succ
ess
Rat
e
OCR Engine
Layout Analysis Performance
General Recognition Access via Content Structure
Content-Based Image Retrieval IMPACT - Text Structure Scenario (no reading order)
Keyword Search in Full Text Phrase Search in Full Text
Print and eBook on Demand
-1..-6%
+1%
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Specifically Trained OCR Engines
16
Europeana Newspapers Information Day, London, 9 June 2014
53.3%
80.3% 81.4% 82.0%85.2% 84.3%
82.4%
50%
60%
70%
80%
90%
100%
Tesseract 3 FRE 10 EPITA JOUVE PAL Fraunhofer 2013
Fraunhofer 2011
Succ
ess
Rat
e
OCR Scenario
Layout Analysis of Historical Newspapers:
Off-the-shelf software vs. optimised/trained systems
+5%
HNLA2013
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
Dedicated Language and Font Support
17
Europeana Newspapers Information Day, London, 9 June 2014
Recognition of Blackletter (Fraktur) Documents:
Standard vs. Gothic
Mode (ABBYY FRE10)
30.8%
94.2%
30.2%
94.0%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Normal Gothic
Succ
ess
Rat
e
Setting for OCR Engine
OCR Performance (Bag of Words)
OCR Performance (count based) OCR Performance (index based)
92.5%93.4%
75.3% 74.9%
99.5% 99.5%
94.2% 94.4%94.7% 94.6%
76.6% 76.1%75.0% 74.7%
60%
65%
70%
75%
80%
85%
90%
95%
100%
Normal Gothic
Succ
ess
Rat
e
FRE10 OCR Engine Setting
Layout Analysis Performance
General Recognition Access via Content Structure
Content-Based Image Retrieval IMPACT - Text Structure Scenario (no reading order)
Keyword Search in Full Text Phrase Search in Full Text
Print and eBook on Demand
+64%
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
OCR results on bi-tonal and re-scanned greyscale images
for documents with varying contrast:
(ABBYY FRE10)
Re-scanning
18
Europeana Newspapers Information Day, London, 9 June 2014
+35%
27.7%
64.2%
28.0%
62.9%
0%
10%
20%
30%
40%
50%
60%
70%
Original Re-scanned
Succ
ess
Rat
e
Dataset
OCR Performance (Bag of Words)
OCR Performance (count based) OCR Performance (index based)
This project is partially funded under the ICT Policy Support Programme (ICT PSP) as part of the
Competitiveness and Innovation Framework Programme by the European Community
http://ec.europa.eu/ict_psp
More information:
PRImA
www.primaresearch.org
Europeana Newspapers
www.europeana-newspapers.eu/
19
Europeana Newspapers Information Day, London, 9 June 2014