proofingenginescreencompressed
TRANSCRIPT
A Ground-Truthing Engine for Proofsetting, Publishing, Re-Purposing and Quality Assurance
Steven Simske and Margaret SturgillImaging Systems Laboratory, HP Labs
NOVEMBER 21, 2003
page 2Steven Simske and Margaret Sturgill, DocEng2003 Conference
Overview of the Digital Library Creation and Fulfillment Process
Paper-based Corpus
POD ServerWorkflow Server
Web-Enabled Device
Computer / Browser
Customized ApplicationMill Servers
Digital Library Generation:
Automated conversion of TIFF to PDF and XML
{
whyeidmclgotktlpaooensuehqyqnamlpzo
page 3Steven Simske and Margaret Sturgill, DocEng2003 Conference
MIT Press Digital Library Creation Specifics
• 4000 Out-of-print books—1.2 million+ pages
• 260+ books placed in Cognet (cognitive network) web community
• The rest available for purchase via MIT Press Classics site
Scan (400 ppi, 8-,24-bit)
Zoning Analysis
OCR
PDF and XML Output
AnalysisForceManhattanRegions()mitprocess -rects -o outdir indir
START
AnalysisForceRectangularPageRegion()mitprocess -1bw -0 outdir indir
AnalysisForceManhattanTextAndGrayRegions()mitprocess -rectgray -0 outdir indir
AnalysisNew()AnalysisAlterRegionPresentationAfterZoning()
mitprocess -0 outdir indir
Use GroundTruthing Applicationon File to input desired regions via an XML file
mitprocess -gtxml -0 outdir indir
PassesAutoQA?
PassesAutoQA?
PassesAutoQA?
PassesAutoQA?
STOP
NO
NO
NO
NO
YES
YES
YES
YES
AnalysisForceGrayRectangularPageRegion()mitprocess -1gray -0 outdir indir
PassesAutoQA?
YES
NO
page 4Steven Simske and Margaret Sturgill, DocEng2003 Conference
OCR Multiple-Engine Combination
OCR Engine 1
OCR Engine 2
OCR Engine 3
Reorder on BBox
Reorder on BBox
Reorder on BBox
Text
Text
T X E T A L I G N M E N TV O T I N G
Text
Image
Grayscale
Color
page 5Steven Simske and Margaret Sturgill, DocEng2003 Conference
OCR Results
Statistics on 20 CogNet book pages
Error Rates (10**(- 3))
11.29 4.50 1.67 (41.6% drop against Abbyy)Standard Deviation
(10**(- 3))16.40 4.56 4.62 3.22
(30.3% drop against Abbyy)
2.86
Scansoft Iris AbbyyCombination
Iris results due to SLP (statistical language processing) techniques to improve output
page 6Steven Simske and Margaret Sturgill, DocEng2003 Conference
Overall Process:
PDF Generation
Zoning- Based PDF: 135 kB
Zoning- Based PDF w/
searchable text (in blue box):
140 kB
Zoning Analysis: 1-4 sec
OCR: 2-10 sec
PDF creation: 1-2 sec
Zoning Analysis: 1-4 sec
PDF creation: 1-2 sec
Compression: 63X
Compression: 61X
page 7Steven Simske and Margaret Sturgill, DocEng2003 Conference
Rationale for the Ground Truth Engine (GTE)
After the automated quality assurance (QA) step, 0.8% of the documents still need some editorial changes
This amounts to nearly 10,000 documents
Additionally, visual QA is performed through rapid PDF output screening—another 1-2% of pages fail Visual QA
Nearly 30,000 pages were “imperfect” after QA, and needed some editing
Voici the GTE…!
page 8Steven Simske and Margaret Sturgill, DocEng2003 Conference
Ground-Truthing Engine Features-Simple UI tools for specifying zoning & re-purposing -Imposed region manager-Integrated analysis engine matching that in the production (POD) system-XML-schema for zoning & layout description
Java version .NET version
We use this one to describe the design!
page 9Steven Simske and Margaret Sturgill, DocEng2003 Conference
File Support
Standard Java JAI and ImageIO toolsStandard .NET System:: Drawing:: Bitmap toolsRaster auto-scaled upon opening to fit into screenFast resize via command menu
page 10Steven Simske and Margaret Sturgill, DocEng2003 Conference
Zoning Engine-Based Region Generation
Regions can be auto-generated with the same zoning engine as the primary POD application to allow direct referencing to the error case reportsRegions can also be generated “on the fly” using HP “Click and Select” technologies
page 11Steven Simske and Margaret Sturgill, DocEng2003 Conference
Zoning Engine-Based Region Generation:Click and Select Capture Model
Statistical Model for Zoning AnalysisIf the set of all region classification types is C, where “text” = C1, “drawing” = C2, “photo” = C3, “table = “C4”, etc., then for all C1…CN (N=number of region types possible), each region Rj where is the set of all M regions formed during segmentation, is assigned probabilities p(Ci), such that:
Rj j=0…M, i=0…Npj(Ci) = 1.0
That is, the given region Rj has a summed probability of 1.0, which represents its relative probabilities over all region types. The differential statistic, DS, for region x with respect to classification y is given by:
DS(x|y) = px(Cy) – max(px(Ci))i y
Click and Select UI Tool is Based on the Statistical Model
Regions can be generated with a “Drag and Click” UI tool that allows successive clicks to define the region vertices
page 12Steven Simske and Margaret Sturgill, DocEng2003 Conference
UI-Based Polygonal Region Generation
Regions can be generated with a “Drag and Click” UI tool that allows successive clicks to define the region verticesWith advanced region management capabilities, this can be used to generate outlining regions
page 13Steven Simske and Margaret Sturgill, DocEng2003 Conference
UI-Based Polygon Region Example
After definition of two polygonal text regionsThese regions are now individual objects, represented as such in both Java/C# and XMLObjects could be “dragged” with an extension to auto-layout
page 14Steven Simske and Margaret Sturgill, DocEng2003 Conference
Rectangular Region Definition (Rubber-banding)
Regions can be generated with a “Drag and Release” UI tool that allows the user to click to define a corner of the rectangle and then to drag to the far corner of the rectangle and releaseUsually set to the default mouse mode
page 15Steven Simske and Margaret Sturgill, DocEng2003 Conference
Re-Classification of Regions
Regions can be Re-classified with the Intuitive “Right Click” + “Pop-Up” MotifFor this project, the region types were {Text, Drawing, Photo, Table and Equation}
Rendering of the Classification Types:Text at 400 ppi, 1-bitDrawing at 400 ppi, 8- or 24-bitTable and Equation at 400 ppi, 1-bitPhoto at 200 ppi, 8- or 24-bit
page 16Steven Simske and Margaret Sturgill, DocEng2003 Conference
Changing Region Bit Depth
Regions can be generated with a “Drag and Release” UI tool that allows the user to click to define a corner of the rectangle and then to drag to the far corner of the rectangle and releaseUsually set to the default mouse mode
Bit-Depths of the “Region Modalities”:BW = Black and White, 1-bit per pixelGray = 8-bit per pixelColor = 24-bit per pixel, sRGB
page 17Steven Simske and Margaret Sturgill, DocEng2003 Conference
Context-Sensitive Right-Click
Regions can be generated with a “Drag and Release” UI tool that allows the user to click to define a corner of the rectangle and then to drag to the far corner of the rectangle and releaseUsually set to the default mouse mode
Several Context-Sensitive Delete Options for Ease of Editing
“Delete All”—Start Over (Often After Rescale)“Delete Previous”—Undo Delete“Delete Nearest”—For Tricky Layouts (Often Before Rescale)
page 18Steven Simske and Margaret Sturgill, DocEng2003 Conference
Region Manager: No Overlap
Regions are Described by BBox, Polygon and Scan Line Segment Representation, Allowing Efficacious “Scale-Up” in Thoroughness of ComparisonAny Overlap Triggered an Error Flag (to Prevent Later Print Job Errors)
Overlap May Be Allowed in Other Publishing Applications, such as VDP
page 19Steven Simske and Margaret Sturgill, DocEng2003 Conference
Region Manager: No Crossing of Line Segments During the Creation of Vertices
Sequential Vertices are Compared to all Previous Vertices for IntersectionAny Intersecting Vertices Disallow the Region FormationThis Prevents Potential “Region Fragmentation” Upon Scaling to Original Resolution
This Region Management Rule is Generally Applicable to Prevent Ill-Shaped Regions
page 20Steven Simske and Margaret Sturgill, DocEng2003 Conference
Design Principles
SPEED OFZONING ENGINE
ACCURACY OFZONING
ANALYSISENGINE
SIMPLICITY OF UIGROUND
TRUTHING
The zoning engine was further tuned to run on low-resolution data so that the analysis would run at the same nominal ppi (75) as the later QA proofingThe Ground Truthing Engine is the last step in the otherwise automatic generation of 1.2 million PDFs from TIFFSmart scaling (integral) is performed on file load—allowing the ready maintenance in most cases of boundary information after scalingZoning engine for document analysis matches that of the Ground Truthing Engine for maximum throughputRedundancy of keys, menus, tool tips for fast training
page 21Steven Simske and Margaret Sturgill, DocEng2003 Conference
Throughput Results
• 94.6% of pages passed Auto QA after the original zoning analysis
• 99.2% of pages passed AutoQA after the iterative attempts at logical zoning analysis
• 97.7% of pages passed both AutoQA and VisualQA. Failures in the latter were generally due to bit depth mismatch for poorly-exposed originals
• Thus, 2.3%, or 27,500 pages needed to be Ground Truthed for full re-purposing (All text OCR’d, all drawings and photos saved in best format, etc.)
• After 30 minutes of training, 2 pages/minute is an easy rate to proof, even while multi-tasking (on phone, email): this means less than 6 human-weeks for the entire set of 1.2 x 106 pages
• This is 2x105 pages “cleaned” per human-week of work
Questions…?