proofingenginescreencompressed

22
A Ground-Truthing Engine for Proofsetting, Publishing, Re- Purposing and Quality Assurance Steven Simske and Margaret Sturgill Imaging Systems Laboratory, HP Labs NOVEMBER 21, 2003

Upload: jordi-arnabat

Post on 11-Feb-2017

57 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ProofingEngineScreenCompressed

A Ground-Truthing Engine for Proofsetting, Publishing, Re-Purposing and Quality Assurance

Steven Simske and Margaret SturgillImaging Systems Laboratory, HP Labs

NOVEMBER 21, 2003

Page 2: ProofingEngineScreenCompressed

page 2Steven Simske and Margaret Sturgill, DocEng2003 Conference

Overview of the Digital Library Creation and Fulfillment Process

Paper-based Corpus

POD ServerWorkflow Server

Web-Enabled Device

Computer / Browser

Customized ApplicationMill Servers

Digital Library Generation:

Automated conversion of TIFF to PDF and XML

{

whyeidmclgotktlpaooensuehqyqnamlpzo

Page 3: ProofingEngineScreenCompressed

page 3Steven Simske and Margaret Sturgill, DocEng2003 Conference

MIT Press Digital Library Creation Specifics

• 4000 Out-of-print books—1.2 million+ pages

• 260+ books placed in Cognet (cognitive network) web community

• The rest available for purchase via MIT Press Classics site

Scan (400 ppi, 8-,24-bit)

Zoning Analysis

OCR

PDF and XML Output

AnalysisForceManhattanRegions()mitprocess -rects -o outdir indir

START

AnalysisForceRectangularPageRegion()mitprocess -1bw -0 outdir indir

AnalysisForceManhattanTextAndGrayRegions()mitprocess -rectgray -0 outdir indir

AnalysisNew()AnalysisAlterRegionPresentationAfterZoning()

mitprocess -0 outdir indir

Use GroundTruthing Applicationon File to input desired regions via an XML file

mitprocess -gtxml -0 outdir indir

PassesAutoQA?

PassesAutoQA?

PassesAutoQA?

PassesAutoQA?

STOP

NO

NO

NO

NO

YES

YES

YES

YES

AnalysisForceGrayRectangularPageRegion()mitprocess -1gray -0 outdir indir

PassesAutoQA?

YES

NO

Page 4: ProofingEngineScreenCompressed

page 4Steven Simske and Margaret Sturgill, DocEng2003 Conference

OCR Multiple-Engine Combination

OCR Engine 1

OCR Engine 2

OCR Engine 3

Reorder on BBox

Reorder on BBox

Reorder on BBox

Text

Text

T X E T A L I G N M E N TV O T I N G

Text

Image

Grayscale

Color

Page 5: ProofingEngineScreenCompressed

page 5Steven Simske and Margaret Sturgill, DocEng2003 Conference

OCR Results

Statistics on 20 CogNet book pages

Error Rates (10**(- 3))

11.29 4.50 1.67 (41.6% drop against Abbyy)Standard Deviation

(10**(- 3))16.40 4.56 4.62 3.22

(30.3% drop against Abbyy)

2.86

Scansoft Iris AbbyyCombination

Iris results due to SLP (statistical language processing) techniques to improve output

Page 6: ProofingEngineScreenCompressed

page 6Steven Simske and Margaret Sturgill, DocEng2003 Conference

Overall Process:

PDF Generation

Zoning- Based PDF: 135 kB

Zoning- Based PDF w/

searchable text (in blue box):

140 kB

Zoning Analysis: 1-4 sec

OCR: 2-10 sec

PDF creation: 1-2 sec

Zoning Analysis: 1-4 sec

PDF creation: 1-2 sec

Compression: 63X

Compression: 61X

Page 7: ProofingEngineScreenCompressed

page 7Steven Simske and Margaret Sturgill, DocEng2003 Conference

Rationale for the Ground Truth Engine (GTE)

After the automated quality assurance (QA) step, 0.8% of the documents still need some editorial changes

This amounts to nearly 10,000 documents

Additionally, visual QA is performed through rapid PDF output screening—another 1-2% of pages fail Visual QA

Nearly 30,000 pages were “imperfect” after QA, and needed some editing

Voici the GTE…!

Page 8: ProofingEngineScreenCompressed

page 8Steven Simske and Margaret Sturgill, DocEng2003 Conference

Ground-Truthing Engine Features-Simple UI tools for specifying zoning & re-purposing -Imposed region manager-Integrated analysis engine matching that in the production (POD) system-XML-schema for zoning & layout description

Java version .NET version

We use this one to describe the design!

Page 9: ProofingEngineScreenCompressed

page 9Steven Simske and Margaret Sturgill, DocEng2003 Conference

File Support

Standard Java JAI and ImageIO toolsStandard .NET System:: Drawing:: Bitmap toolsRaster auto-scaled upon opening to fit into screenFast resize via command menu

Page 10: ProofingEngineScreenCompressed

page 10Steven Simske and Margaret Sturgill, DocEng2003 Conference

Zoning Engine-Based Region Generation

Regions can be auto-generated with the same zoning engine as the primary POD application to allow direct referencing to the error case reportsRegions can also be generated “on the fly” using HP “Click and Select” technologies

Page 11: ProofingEngineScreenCompressed

page 11Steven Simske and Margaret Sturgill, DocEng2003 Conference

Zoning Engine-Based Region Generation:Click and Select Capture Model

Statistical Model for Zoning AnalysisIf the set of all region classification types is C, where “text” = C1, “drawing” = C2, “photo” = C3, “table = “C4”, etc., then for all C1…CN (N=number of region types possible), each region Rj where is the set of all M regions formed during segmentation, is assigned probabilities p(Ci), such that: 

Rj j=0…M, i=0…Npj(Ci) = 1.0

That is, the given region Rj has a summed probability of 1.0, which represents its relative probabilities over all region types. The differential statistic, DS, for region x with respect to classification y is given by:

 DS(x|y) = px(Cy) – max(px(Ci))i y

Click and Select UI Tool is Based on the Statistical Model

Regions can be generated with a “Drag and Click” UI tool that allows successive clicks to define the region vertices

Page 12: ProofingEngineScreenCompressed

page 12Steven Simske and Margaret Sturgill, DocEng2003 Conference

UI-Based Polygonal Region Generation

Regions can be generated with a “Drag and Click” UI tool that allows successive clicks to define the region verticesWith advanced region management capabilities, this can be used to generate outlining regions

Page 13: ProofingEngineScreenCompressed

page 13Steven Simske and Margaret Sturgill, DocEng2003 Conference

UI-Based Polygon Region Example

After definition of two polygonal text regionsThese regions are now individual objects, represented as such in both Java/C# and XMLObjects could be “dragged” with an extension to auto-layout

Page 14: ProofingEngineScreenCompressed

page 14Steven Simske and Margaret Sturgill, DocEng2003 Conference

Rectangular Region Definition (Rubber-banding)

Regions can be generated with a “Drag and Release” UI tool that allows the user to click to define a corner of the rectangle and then to drag to the far corner of the rectangle and releaseUsually set to the default mouse mode

Page 15: ProofingEngineScreenCompressed

page 15Steven Simske and Margaret Sturgill, DocEng2003 Conference

Re-Classification of Regions

Regions can be Re-classified with the Intuitive “Right Click” + “Pop-Up” MotifFor this project, the region types were {Text, Drawing, Photo, Table and Equation}

Rendering of the Classification Types:Text at 400 ppi, 1-bitDrawing at 400 ppi, 8- or 24-bitTable and Equation at 400 ppi, 1-bitPhoto at 200 ppi, 8- or 24-bit

Page 16: ProofingEngineScreenCompressed

page 16Steven Simske and Margaret Sturgill, DocEng2003 Conference

Changing Region Bit Depth

Regions can be generated with a “Drag and Release” UI tool that allows the user to click to define a corner of the rectangle and then to drag to the far corner of the rectangle and releaseUsually set to the default mouse mode

Bit-Depths of the “Region Modalities”:BW = Black and White, 1-bit per pixelGray = 8-bit per pixelColor = 24-bit per pixel, sRGB

Page 17: ProofingEngineScreenCompressed

page 17Steven Simske and Margaret Sturgill, DocEng2003 Conference

Context-Sensitive Right-Click

Regions can be generated with a “Drag and Release” UI tool that allows the user to click to define a corner of the rectangle and then to drag to the far corner of the rectangle and releaseUsually set to the default mouse mode

Several Context-Sensitive Delete Options for Ease of Editing

“Delete All”—Start Over (Often After Rescale)“Delete Previous”—Undo Delete“Delete Nearest”—For Tricky Layouts (Often Before Rescale)

Page 18: ProofingEngineScreenCompressed

page 18Steven Simske and Margaret Sturgill, DocEng2003 Conference

Region Manager: No Overlap

Regions are Described by BBox, Polygon and Scan Line Segment Representation, Allowing Efficacious “Scale-Up” in Thoroughness of ComparisonAny Overlap Triggered an Error Flag (to Prevent Later Print Job Errors)

Overlap May Be Allowed in Other Publishing Applications, such as VDP

Page 19: ProofingEngineScreenCompressed

page 19Steven Simske and Margaret Sturgill, DocEng2003 Conference

Region Manager: No Crossing of Line Segments During the Creation of Vertices

Sequential Vertices are Compared to all Previous Vertices for IntersectionAny Intersecting Vertices Disallow the Region FormationThis Prevents Potential “Region Fragmentation” Upon Scaling to Original Resolution

This Region Management Rule is Generally Applicable to Prevent Ill-Shaped Regions

Page 20: ProofingEngineScreenCompressed

page 20Steven Simske and Margaret Sturgill, DocEng2003 Conference

Design Principles

SPEED OFZONING ENGINE

ACCURACY OFZONING

ANALYSISENGINE

SIMPLICITY OF UIGROUND

TRUTHING

The zoning engine was further tuned to run on low-resolution data so that the analysis would run at the same nominal ppi (75) as the later QA proofingThe Ground Truthing Engine is the last step in the otherwise automatic generation of 1.2 million PDFs from TIFFSmart scaling (integral) is performed on file load—allowing the ready maintenance in most cases of boundary information after scalingZoning engine for document analysis matches that of the Ground Truthing Engine for maximum throughputRedundancy of keys, menus, tool tips for fast training

Page 21: ProofingEngineScreenCompressed

page 21Steven Simske and Margaret Sturgill, DocEng2003 Conference

Throughput Results

• 94.6% of pages passed Auto QA after the original zoning analysis

• 99.2% of pages passed AutoQA after the iterative attempts at logical zoning analysis

• 97.7% of pages passed both AutoQA and VisualQA. Failures in the latter were generally due to bit depth mismatch for poorly-exposed originals

• Thus, 2.3%, or 27,500 pages needed to be Ground Truthed for full re-purposing (All text OCR’d, all drawings and photos saved in best format, etc.)

• After 30 minutes of training, 2 pages/minute is an easy rate to proof, even while multi-tasking (on phone, email): this means less than 6 human-weeks for the entire set of 1.2 x 106 pages

• This is 2x105 pages “cleaned” per human-week of work

Page 22: ProofingEngineScreenCompressed

Questions…?