1 digital libraries & document image analysis henry s. baird statistical pattern & image...

31
1 Digital Libraries & Digital Libraries & Document Image Analysis Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

Post on 18-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

1

Digital Libraries &Digital Libraries &Document Image AnalysisDocument Image Analysis

Henry S. Baird

Statistical Pattern & Image Analysis research

Information Sciences & Technologies Lab

Page 2: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

2

DLs as seen by a DIA ResearcherDLs as seen by a DIA Researcher

15 years in DIA R&D

Lucky to have known/collaborated with:– PARC DL enthusiasts: Masinter, Street, Bloomberg, et al– UC Berkeley Digital Library project: Wilensky, Fateman, et al– CMU Universal Library project: Thibadeau, Hauptmann, et al– Xerox Scanning Service Bureaus: Wallis, et al– … many others with an interest in DLs

What challenges do DLs pose to DIA R&D?

Page 3: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

3

Digital Library DreamsDigital Library Dreams

Electronic networked DLs promise to provide:– more books, journals, etc– to more people– faster– at more places & times

than physical libraries can hope to….

The Ideal DL: an international, interoperable, sustainable body of rich cultural materials in digital form

Page 4: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

4

Document Images’ Usefulness in DLsDocument Images’ Usefulness in DLs

display, print raster image

+ retrieve (more or less well) + OCRed text

+ retrieve well, reuse, summarize, translate, …

+ correct text

+ Web publishing + links (e.g. HTML)

+ “semantic web” + functional tags (e.g. XML)

+ reprinting + layout format (e.g. RTF)

+ index, catalogue + metadata (title, author, …)

Page 5: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

5

Advantages of Digital DisplaysAdvantages of Digital Displays versusversus Ink-on-Paper Ink-on-Paper

Many…– networked -- potentially unbounded content– rapidly rewritable -- supports animation– radiant -- legible in the dark– sensitive -- markable, interactive

Generally thought to be overwhelming, but …

Page 6: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

6

Advantages of Ink-on-PaperAdvantages of Ink-on-Paper versusversus Digital Displays Digital Displays

PAPER cheap large, many high-resolution lightweight thin unpowered stable

DISPLAYS today expensive small, few low-resolution heavy thick powered requires

maintenance

DISPLAYS in future less expensive larger, more higher-resolution lighter thinner lower power

eBooks, e-paper,notebooks, laptops,PDAs, …A. Dillon, “Reading from Paper versus Screen: a critical review

of the empirical literature,” Ergonomics 53(10): 1297-1326, 1992.

Page 7: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

7

The fact is, for many usesThe fact is, for many uses Paper is Still Widely Preferred Paper is Still Widely Preferred

“Paper [remains today] the medium of choice for reading, even when most high-tech [display] technologies are to hand”

— Sellen & Harper (2002)

Why is this? Paper allows:– flexible navigation though documents– cross-referencing of several documents– annotations– interweaving of reading and writing

A. J. Sellen & R. H. R. Harper, The Myth of the Paperless Office, The MIT Press, Cambridge, MA, 2002.

Page 8: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

8

Document Images are DoublyDocument Images are Doubly Disadvantaged within DLs Disadvantaged within DLs

They fail to support most uses that symbolically encoded, tagged data do They lose many key advantages they enjoyed on paper

A Threat: ‘If it’s not in Google, I don’t need it!’

Can they be made as useful in DLs as encoded data?

Can they sometimes work better in DLs than encoded data?

…these are challenges to us, the DIA R&D community.

Page 9: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

9

The British LibraryThe British Library ‘ ‘The World’s Knowledge’The World’s Knowledge’

38.8M items catalogued

website: 18.4M page hits/year

Compare Google:• >3B pages• 150M searches/day

“[Reinforcing] the Library’s role as the pre-eminent

global document supplier, digital scanning from print

and microfilm originals will give researchers rapid,

high quality delivery from over 100 million research

articles, reports, and conference papers direct to

their desktop.”

-- Lynne Brindley, Chief Executive

2002-2003 Annual Report

Page 10: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

10

Bibliothèque nationale de FranceBibliothèque nationale de France

The Digital Library

– digitization of both printed books and graphic material

– primarily in image mode to begin with

– most out-of-copyright

Gallica 2000

– multimedia documents: Middle Ages -> early 20th century

– 35,000 printed volumes: images

– 1000 titles full text

– “one of the largest DLs free of charge on the web”

Page 11: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

11

Million Book DL ProjectMillion Book DL Project

1M books to be scanned by 2005– bitonal, 600 dpi

Free-to-read, universally accessible Searchable by full text (where OCR is possible)

– ABBYY Fine Reader OCR Books in public domain or copyrighted but out of print Fifteen partners:

– US, India, China; est. 4000 person-years of clerical labor– Multinational, multilingual (mainly English)

20Tbyte trusted repository Research testbed for summarization, OCR, automatic

extraction of metadata, machine translation

Reddy, Raj and Gloriana St. Clair, “The Million Book Project,” CMU, Dec. 1, 2001.

Page 12: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

12

Google CatalogsGoogle Catalogs

“1000’s” of scanned mail-order catalogs free for publishers, ‘few days’ turnaround

– for a fee: link products to web sites free to users: download page images indexed by: vendor, date, page numbers, etc (not by full text content)

Page 13: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

13

Amazon.com planAmazon.com plan ‘Look Inside the Book II’‘Look Inside the Book II’

~500k books: in-copyright, non-fiction Scan (full color), OCR cover-to-cover Full-text search, download sample pages Free but limited access to page images———Can Google be far behind…? search document image files found on Web

David D. Kirkpatrick, “Amazon Plan Would Allow Searching Text of Many Books,” The New York Times, July 21, 2003.

Page 14: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

14

Capturing Document ImagesCapturing Document Images

To digitize a book: $4 - $1000 each!

cheaply: bitonal, low quality, mass scanning, …

expensively: color, quality control, custom handling, …

“The Price of Digitization,” Proc., NINCH Symposium (National Initiative for a Networked Cultural Heritage), New York, April 8, 2003.

Breakdown of costs:1/3 cataloging, description, indexing

1/3 scanning, OCR, correction, markup

1/3 quality control, file maintenance, admin

NOTE: DIA can help with all three

Page 15: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

15

Document Image Capture OperationsDocument Image Capture Operations

Usually, large-scale batch operations Sometimes destructive:

– cut off spines, discard covers, wear & tear– hot debate over ‘scan-and-discard’ policies

Image quality standards are often subjective– usually: “completeness”; no missing pages, text– seldom: checked for human, machine legibility– rarely: guaranteed suitable for future uses

Scan once, for ever:– seldom rescanned (Lesk: “not for 5-10y”)

M. Lesk, Practical Digital Libraries: Books, Bytes, & Bucks. Morgan Kaufmann, San Francisco, CA, 1997.

Page 16: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

16

The PARC Rare Book ScannerThe PARC Rare Book Scanner

• Bulk scanning w/out

damaging books• Zero force on binding• Book is open 90 degrees• Pages turned manually• 280 dpi• 9.25”x11.75” field• Throughput

• 8-bit grey 450 pages/h

• 24-bit color 120 pages/h

Bob Street & Steve Ready, PARC.

Page 17: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

17

GUI & IP for Image CaptureGUI & IP for Image Capture

• Capturing Metadata

• automatic page numbering 1,2,3,.../ i,ii,iii,.../ I,II,III,…

• section labels

• comments (manual)

• Image Processing• performed on the fly:

• contrast, cleaning, etc• crop. skew-correct

• processing templates

• Assuring Quality

• visual inspection

• Calibration

• color test targets

• per-pixel gain/offset map

Page 18: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

18

DIA R&D for Image Quality ControlDIA R&D for Image Quality Control Measuring document image quality

– new test target designs– image processing algorithms– rigorous, quantitative standards

Assuring quality – fast algorithms for on-the-fly image quality

estimation Predicting human & machine legibility

What image quality features correlate well with human and OCR legibility? … and with other, later DIA tasks?

K. Summers, “Document Image Improvement forOCR as a Classification Problem,” Proc., DR&RX, Santa Clara,CA, Jan 2003.

E. H. Barney Smith & X. Qiu, “RelatingStatistical Image Differences & DegradationFeatures,” Proc, 5th DAS, Princeton, NJ., Aug 2002.

Page 19: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

19

When Quality Control Goes WrongWhen Quality Control Goes WrongFront Page, 1852 Edition of the New York Times

The Historical New York Times Project, CMU/NYT, 1999.

Scanned from microfilm.

Page 20: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

20

Extracting & Recognizing ContentExtracting & Recognizing Content

These are central DIA R&D goals

But existing doc image understanding systems cannot guarantee high accuracy across the full range of documents:

- typefaces, h/w styles

- image qualities

- layout geometries

- writing systems

- languages

- domains of discourse

S. Rice, G. Nagy, T. Nartker, OCR: An Illustrated Guide to the Frontier, Kluwer Academic Publishers: 1999.

DL’s scholarly & historical docs are often harder

old fashioned

poor & variable

deformed

obsolete

rare

arcane

Page 21: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

21

Rare Botanical Reference Book

• Jepson’s A Flora of California, 1943.

• Authoritative, still in demand by scholars

• Only a few copies are left

• Difficult to OCR well

• Scanned at PARC, all page images put

on the Univ. California, Berkeley Digital

Library website

Richly MeaningfulRichly Meaningful Typographical Book Designs Typographical Book Designs

Page 22: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

22

Cut into Word-box Images:Cut into Word-box Images: layout analysis without OCR layout analysis without OCR

Page 23: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

23

Reflow Word Boxes into TextlinesReflow Word Boxes into Textlines to Fit the Display Geometry to Fit the Display Geometry

T. Breuel, W. Janssen, K. Popat, H. Baird, “Paper to PDA,” Proc., ICPR, Quebec City, 2002.

Page 24: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

24

Make Doc-Images Highly Portable,Make Doc-Images Highly Portable, Legible Everywhere Legible Everywhere

No OCR errors!(Only layout errors.)Preserve meaningful appearance

Challenges: reading order non-text navigation linking

Page 25: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

25

For Text seems feasible

– Summarization of doc images w/out OCR

– Outlining, condensing, linking

– Reflowing tables

For Non-text seems dauntingly hard

– Mathematics

– Chemical formulae

– Line-art drawings

– Graphics generally

Other ‘Pure-Image’ DIA for DLsOther ‘Pure-Image’ DIA for DLsNot Dependent on Accurate RecognitionNot Dependent on Accurate Recognition

Vitally important to trysince recognition & encodingare highly problematic

Page 26: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

26

Personal Digital LibrariesPersonal Digital Libraries

People are beginning to– collect– manage– share

their own small DLs Scanned & encoded documents, mixed together How to assist ‘productive reading’ These users lack specialized skills DIA tools need to be deskilled to a clerical level … and to work together far better

Thanks to: Jon Hull et al, Ricoh Innovations; Robert Wilensky et al, UC Berkeley; Larry Spitz, DocRec; Kris Popat et al, PARC.

Page 27: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

27

Interactive Digital LibrariesInteractive Digital Libraries Today’s DIA tools leave many errors

in recognition, encoding, tagging etc How can these mistakes affordably be fixed? Invite volunteer help:

– e.g. Gutenberg Project, Open Mind Initiative

Challenge: provide interactive tools to– accept corrections on-line– enforce review, verification– efficiently make the most of every correction– DIA tools able to benefit from correction

Thanks to: George Nagy, David Stork, Dan Lopresti.

Page 28: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

28

Collaborative DLs:Collaborative DLs: DIA for the Masses DIA for the Masses

Enable non-professionals to collaborate

in improving, manually, on the best that

automatic DIA tools can do, e.g.– one person may correct thresholding– another corrects OCR errors– yet another adds tags

Offer DIA tools downloadable from the web,

possibly under GPL-like licenses Dimp ? — document image processing toolkit

interoperable via common data structures & file formats

Thanks to: Tom Breuel, Kris Popat, Bill Janssen.

Page 29: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

29

DIA R&D Opportunities for DLsDIA R&D Opportunities for DLsMaking Document Images as Useful as Making Document Images as Useful as

Symbolically Encoded DataSymbolically Encoded Data

Image capture, quality control

Image improvement, rectification, etc

Content extraction, recognition, & analysis

Legibility, presentation, reflowing

Markup, indexing, retrieval, summarization

Personal & interactive DLs

Offering DIA tools to DL users

… many more, no doubt

Page 30: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

30

An Urgent Responsibility?An Urgent Responsibility?

Vast, irreplaceable, culturally vital legacy collections

of paper documents are competing ineffectively for

attention with billions of digital documents

Thus paper archives are threatened with neglect,

perceived irrelevance, …. & eventually, oblivion?

The DIA community is uniquely qualified

to help the DL community rescue them.

Page 31: 1 Digital Libraries & Document Image Analysis Henry S. Baird Statistical Pattern & Image Analysis research Information Sciences & Technologies Lab

ICDAR Aug 4, 2003 - HSB

31

ContactContact

Henry S. BairdStatistical Pattern & Image Analysis

[email protected]/baird

+1-650-812-4481 FAX –4374