imago ocr: open-source toolkit for chemical structure image recognition
DESCRIPTION
http://ggasoftware.com/opensource/imago Presentation at the Symposium on 244th ACS National Meeting & Exposition. Hunting for Hidden Treasures: Chemical Information in Patents and Other DocumentsTRANSCRIPT
![Page 1: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/1.jpg)
Imago OCROpen-source toolkit for chemical
structure image recognition
14/08/2012 GGA Software Services LLC 1
http://ggasoftware.com/opensource/imago/
![Page 2: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/2.jpg)
Project goals
• Perform the optical chemical structure recognition applicable for a wide range of raster images:– different image formats
– various scanning quality (or even photo)
– complex structures and uncommon features
• Provide complete toolset for embedding recognition engine in any other application
GGA Software Services LLC 214/08/2012
![Page 3: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/3.jpg)
Applications
• Automated articles and patents processing
– similarity analysis
• Chemical database search (PubChem, etc.)
• “The Deep Web indexing”
– development of a universal chemical search engine;
– conversion of a human-readable data to machine-readable formats
14/08/2012 GGA Software Services LLC 3
![Page 4: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/4.jpg)
Use case
14/08/2012 GGA Software Services LLC 4
Source image MOL format
imago
• BMP, DIB, JPG, JPE, PNG, PBM, PGM, PPM, SR, RAS, TIFF;
• Images from scanner/camera;• PDF document
• MDL Molfile;• SMILES (requires Indigo);• Rendered image (requires
Indigo)
![Page 5: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/5.jpg)
Supported features
• Multiple bonds
• Single-up & single-down bonds
• Bridged bonds
• Aromatic rings
14/08/2012 GGA Software Services LLC 5
![Page 6: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/6.jpg)
Supported features
• Superatom labels,
charges, isotopes
• Abbreviations expansion
• R-groups handling
• Query features
14/08/2012 GGA Software Services LLC 6
![Page 7: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/7.jpg)
Engine structure
14/08/2012 GGA Software Services LLC 7
Prefilter & Binarization
Vectorization & Separation
Logical layout analyzer
Image loader
Molecule export
Raster level
Primitives level
Structural level
![Page 8: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/8.jpg)
Preliminary filters
• Pass-through filter
– For rendered images (only binarization)
• Cross-correlation based filter
– For scanned images (quite fast)
• Logical analysis based filter
– For low-quality photos
– Takes some time for processing
• Imago allows auto-detection of suitable filter
14/08/2012 GGA Software Services LLC 8
![Page 9: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/9.jpg)
Cross-correlation based filter
14/08/2012 GGA Software Services LLC 9
Source image Strong threshold Weak threshold
← Filter result: image combined of weak threshold image segments that passes the restrictions of the CC value between corresponding strong threshold image segments
![Page 10: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/10.jpg)
Logical analysis based filter
• Removes noise (spots, light glares)
• Suitable for out-of-focus images
• Can process low-contrast images
• Removes unusual artifacts
• Deals with multicolor photos
• Keywords: wiener filtering, wave algorithm, weak segmentation
14/08/2012 GGA Software Services LLC 10
![Page 11: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/11.jpg)
Preliminary separation
• Separate labels and graphics:
• Hu moments classifier (d1)
• Contours analysis (d2)
• Approximation criteria (d3)
• Object is symbol if f(d1, d2, d3) > c0
14/08/2012 GGA Software Services LLC 11
![Page 12: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/12.jpg)
Vectorization
• Convert pixels to a matching polyline:
• Minimization of mean distance between original and vectorized structure
– Penalty for extra segments
14/08/2012 GGA Software Services LLC 12
![Page 13: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/13.jpg)
Logical layout analysis
• Mapping labels to bonds– Group labels into superatoms
• Finding multiple bonds– Dissolving of short edges
– Connection of bridged bonds
• Removal of surely unrelated captions
• Detection of aromatic rings– Figuring out stereo bonds orientation and
aromatizing molecule if circles were presented
14/08/2012 GGA Software Services LLC 13
![Page 14: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/14.jpg)
Adaptive methods or particular cases?
• Adaptive methods
– Based on optimization of some function
– Wider input class range
– Probably better results in hard cases
14/08/2012 GGA Software Services LLC 14
• Particular-case methods
– Based on some criteria
– Stability
– Good performance
– Easier implementation
![Page 15: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/15.jpg)
Particular case methods
• What is it?
• Line? Tested line criteria: no.
• Character? Tested against ‘A’: no.… Tested against ‘Z’: no.
• Ring? no.
• Unrecognizable object – ignore.
14/08/2012 GGA Software Services LLC 15
![Page 16: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/16.jpg)
Adaptive methods
14/08/2012 GGA Software Services LLC 16
• What is it?
• Line: approximation: d=1.6
• Character? Compared with ‘C’: d=6.1… Compared with ‘L’: d=3.2
• Ring? approximation: d=653.3
• Final decision depends on neighbors
![Page 17: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/17.jpg)
Decision tree
14/08/2012 GGA Software Services LLC 17
Label with d=0.1 (almost surely recognized)
Then object is a bond and segments group recognized as bond + label with d=0.1+1.6=1.7
Bond with d=0.0
“C” with d=0.1
Then object is a letter ‘l’ and segments group recognized as bond + label of two chars with d=0.0+0.1+3.2=3.3
![Page 18: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/18.jpg)
Metrics
• For symbols– Distance between Fourier descriptors set
• For graphics– Distance between approximated and source image
• For single-up bonds– f(average fill, relative size, etc.)
• For single-down bonds– f(distance between segments, line thickness, etc.)
• … (every recognition method has a metric function)
14/08/2012 GGA Software Services LLC 18
![Page 19: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/19.jpg)
Labels correction
• Any recognized symbol can have alternatives:
: A(metric value of 3.2), R(4.9), P(5.0)
• Imago keeps probable captions information (periodic table, abbreviations)
• Labels correction: select such combination of symbols alternatives that is probably and the sum of metric values is minimal
• Allows to recognize partially broken labels
14/08/2012 GGA Software Services LLC 19
![Page 20: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/20.jpg)
Recognition
• Image recognition is a search of vectorized result gives minimal distance value between vectorized form and original image
• Can be formalized depending on metrics
• Search is exhaustive
– Needs some restrictions to achieve good speed
14/08/2012 GGA Software Services LLC 20
![Page 21: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/21.jpg)
Trade-off: restricted adaptive methods
• Limit metric values: d < 0.5 – surely; d > 10.0 –impossibly
• Limit Euclidian distances for neighbors search (up to 100 pixels)
• Limit alternatives count (not more than 10)• Assume image filling rate is less than 10%• Assume the distances for single-down bonds segments
is in range 5..10 pixels• Assume the symbol aspect ratio is in range 0.5..2.0• Some more assumptions with the “magic” constants• Gains the speed and stability
14/08/2012 GGA Software Services LLC 21
![Page 22: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/22.jpg)
Configuration clusters
• For scanned images– Strict adaptive methods limits (fast, <300ms per image)
• For photos and low quality images– Flexible limits (less than a second per image in average)
• For high-resolution images – up to 5 seconds
• For handwritten structures– up to 10 seconds in complex cases
• Imago supports auto-detection of suitable configuration cluster
14/08/2012 GGA Software Services LLC 22
![Page 23: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/23.jpg)
Configuration cluster creation
• Allows to gain better recognition success rate for specified images type:
– different render type
– images captured differently (scanner type, lighting conditions, etc.)
• Process is automated
– test set of target images type is required
– takes some time
– machine learning application
14/08/2012 GGA Software Services LLC 23
![Page 24: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/24.jpg)
Machine learning
• Test set: amount of pairs (image; related MDL molfile)
• Imago will tune the method parameters to gain the best score on the test collection– Metrics included
– No information directly related to test set (such a characters table) is stored
• Criteria of the complete set will be formed by small subset of the same type
14/08/2012 GGA Software Services LLC 24
![Page 25: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/25.jpg)
Learning effectiveness
• Used Img2Structure test set with different renderer:
• Initial results (before training): 202/944 correct, similarity value: 74.54%
• Trained on set of 50 images with new render
• Trained results: 831/944 correct, similarity value: 98.33% on the whole set
14/08/2012 GGA Software Services LLC 25
![Page 26: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/26.jpg)
Comparison: overall scores 1
• Image2Structure set from TREC 2011 Chemical IR Track (removed ambiguous & partial structures): original files
14/08/2012 GGA Software Services LLC 26
OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1
Absolutely correct 769 / 944 540 / 944 861 / 944
Almost correct1 +31 +49 +43
Average time 2.54s 0.20s 0.31s
Average similarity2 94.57% 89.59% 98.26%
1 similarity value is greater than 95%;2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too.
![Page 27: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/27.jpg)
Comparison: overall scores 2
• Image2Structure re-rendered using appropriate molfiles
14/08/2012 GGA Software Services LLC 27
OSRA 1.4.0 Imago 1.0 Imago 2.0 beta 1
Absolutely correct 796 / 944 604 / 944 831 / 944
Almost correct1 +20 +58 +29
Average time 4.57s 0.47s 1.24s
Average similarity2 93.45% 95.38% 98.33%
1 similarity value is greater than 95%;2 correct elements (atoms and bonds) ratio; extra and missing elements are counted too.
![Page 28: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/28.jpg)
Common issues resolved
14/08/2012 GGA Software Services LLC 28
Source OSRA Imago
Large gap
Lines too close
No more symbols
![Page 29: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/29.jpg)
Imago Library
• API: Methods set for– Image loading– Configuration clusters setup– Retrieving molfile results– Partial processing (filtering, approximation, validation)
• Bindings for C/C++, Java• Cross-platform implementation (Windows, Linux, Mac)• Dependencies:
– Boost library (LGPL license)– OpenCV library (BSD license)– Indigo (optional)
14/08/2012 GGA Software Services LLC 29
![Page 30: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/30.jpg)
Thank you for the attention!
• Imago OCR:http://ggasoftware.com/opensource/imago/
• Try imago recognition engine online:http://ggasoftware.com/opensource/imago/online/
14/08/2012 GGA Software Services LLC 30
![Page 31: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/31.jpg)
Appendix AImago: technical details
14/08/2012 GGA Software Services LLC 31
![Page 32: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/32.jpg)
Pass-trough prefilter
• Calculate black, white and others pixels
• If (black + white) > t0 ∙ others,
– recolor others to black → image is binarized
– else schedule another prefilter call
• Perform accurate image downscale when image is too large (>5Mpix)
14/08/2012 GGA Software Services LLC 32
![Page 33: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/33.jpg)
Cross-correlation prefilter
• Smooth source image → smoothed– Pyramidal reduce 2x, then pyramidal upsample 2x
• Process adaptive threshold binarization filter of smoothed image:– With threshold t0 → strong– With threshold t1 → weak
• Segmentate (strong, weak) images using wavemap algorithm• For each weak segment find appropriate strong segment and
calculate intersection:– If intersection area to original segment area ratio is less than c0 then
remove this segment (bad segment)
• If reassembled image contains the rectangular structure R – crop image to R inner dimensions (locate molecules)
• Calculate average pixels intensity for good segments and try to add other pixels with intensity passing this boundary (if they’re not affecting segments connectivity)
14/08/2012 GGA Software Services LLC 33
![Page 34: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/34.jpg)
Separator details
• Given a binarized set of segments classify them into two main groups: letters and chemical bond representation
• Classification result is based on the value of C = k0 ∙ r0 + k1 ∙ r1 + k2 ∙ r2
– Where (r0, r1, r2) are submethods results
– And (k0, k1, k2) – weight constants (configurable)
14/08/2012 GGA Software Services LLC 34
![Page 35: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/35.jpg)
Separator: Hu moments
• Hu moments usually differs for characters and bonds, so the classification tree can be computed
• Note: some objects can not be classifiedthat way
14/08/2012 GGA Software Services LLC 35
symbolsr0 = 0
bondsr0 = 1
![Page 36: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/36.jpg)
Separator: contours analysis
• Extract the outer contour of the binarized segment S;– approximate the chain contour using Teh-Chin chain
approximation algorithm;– taking line thickness as a approximation parameter the polygon
is approximated once again;– calculate the offsets of the contour points by a clockwise step;– the output is a chain of sequential vectors normalized by their
perimeters;
• Compare the chain result to the set of patterns describing valid structures– The set contains of 8x8 matrices where the cell (j, k) denotes
the probability of changing the jth direction to the kth.
• Result of this method is r1 – probability of {S is a bond}
14/08/2012 GGA Software Services LLC 36
![Page 37: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/37.jpg)
Separator: approximation criteria
• For a given segment S we calculate its best approximation with n line segments (d0) and the closest distance to the most probable character (d1)– If d1 < d0 and n > n0 then probably segment
represents character• Check its width/height ratio, height/average_height
ratio: penalty p0 if this criteria is not matched
• Result is r2 = 1 - (d1 [+ p0]) – probability of {S is a bond}
– Result is r2 = d0 – probability of {S is a bond}
14/08/2012 GGA Software Services LLC 37
![Page 38: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/38.jpg)
Bonds skeleton analysis
• Dissolve short edges
• Join closest vertices
• Dissolve intermediate vertices
• Find multiple edges
• Connect bridged bonds
• Shrink short bonds
• Detect and mark suspicious edges
14/08/2012 GGA Software Services LLC 38
![Page 39: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/39.jpg)
Basic labels analysis
• Location analysis: check against baseline– The subscripts are underline:
– Capitals mostly above line:
• Calculate distances to all possible characters:
• Alternate distances using topological features
• Select the best result candidate and calculate recognition quality:
14/08/2012 GGA Software Services LLC 39
![Page 40: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/40.jpg)
Superatoms analysis
• Concatenate recognized characters into labels
• Check chemical validity
• If validity check is failed – try to find the most probable alternative using other distance map elements
• If such alternative is not found – try to recognize the less probable characters as bonds
• Handle R-semantic, special characters: X, Q, A
14/08/2012 GGA Software Services LLC 40
![Page 41: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/41.jpg)
Appendix BImago: workflow features
14/08/2012 GGA Software Services LLC 41
![Page 42: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/42.jpg)
Related continuous integration system
14/08/2012 GGA Software Services LLC 42
…
Versions list
Results estimation
Test sets
![Page 43: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/43.jpg)
Explanation: continuous integration
• Some logically grounded changes may decrease the recognition rate → convenient tracking tool is required
• Good way to improve overall stability
• Useful visual representation of the machine-learning progress
14/08/2012 GGA Software Services LLC 43
![Page 44: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/44.jpg)
Embedded HTML-based logging system
14/08/2012 GGA Software Services LLC 44
Embedded images
Performance counters
Variables and parameters dump
Call hierarchy
![Page 45: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/45.jpg)
Explanation: logging system
• Structured logs (reports) are offering– Convenient way of bugs detection;
– Exact visual representation of the internal processes;
• Several improvements may be evident just by looking through logs
• Performance decrease is comparable to the (usual) plaintext logs
• Stability is not affected
14/08/2012 GGA Software Services LLC 45
![Page 46: Imago OCR: Open-source toolkit for chemical structure image recognition](https://reader035.vdocuments.us/reader035/viewer/2022081720/559c68bd1a28abae5f8b4631/html5/thumbnails/46.jpg)
Authors
• Rostislav Chutkov
• Michael Rybalkin
• Kliton Andrea
• Victor Smolov
• GGA Software Services LLC
14/08/2012 GGA Software Services LLC 46