adaptive method for the digitization of mathematical journals file adaptive method for the...
TRANSCRIPT
![Page 1: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/1.jpg)
http://www.inftyproject.org/
Adaptive method for the digitization of mathematical journals
Masakazu Suzuki Kyushu University, Professor emeritus
Kyushu Institute of Systems, Informatics and Nanotechnologies (ISIT) InftyProject ((http://www/inftyproject.org)
Science Accessibility Net (http://www.sciaccess.net)
IMU-WDML Workshop
June 2, 2012, Washington DC
![Page 2: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/2.jpg)
http://www.inftyproject.org/ 2
Plan of the talk
About InftyProject
Making Rich Digital Mathematical Libraries Process Flow and Technical Components
Current State of the Art with Demonstration
Adaptive Method Character and Symbol Recognition
Logical Structure Analysis
Future Problems
![Page 3: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/3.jpg)
http://www.inftyproject.org/ 3
Section 1 About Infty Project
![Page 4: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/4.jpg)
http://www.inftyproject.org/ 4
InftyProject
R&D on Math Information Systems
Main system development InftyReader : Math OCR software
InftyEditor : Editor of math documents Data conversion(XML, LaTeX, MathML, PDF, etc.)
ChattyInfty : InftyEditor + speech output, Authoring of DAISY
URL: Project site: http://www.inftyproject.org/en//
Release & user support of Infty products: Science Accessibility Net http://www.sciaccess.net/
![Page 5: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/5.jpg)
http://www.inftyproject.org/ 5
InftyProject
R&D on Math Information Systems
Main system development InftyReader : Math OCR software
InftyEditor : Editor of math documents Data conversion(XML, LaTeX, MathML, PDF, etc.)
ChattyInfty : InftyEditor + speech output, Authoring of DAISY
URL: Project site: http://www.inftyproject.org/en//
Release & user support of Infty products: Science Accessibility Net http://www.sciaccess.net/
![Page 6: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/6.jpg)
http://www.inftyproject.org/ 6
“InftyReader” OCR software for math documents
Demonstration. Recognition result samples (YMJ, AJM).
![Page 7: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/7.jpg)
http://www.inftyproject.org/ 7
Section 2 Toward Rich DML
![Page 8: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/8.jpg)
http://www.inftyproject.org/ 8
Different levels in digitization Level 1: Bitmap images of printed materials e.g. GIF, TIFF
Level 2: Searchable digitized document e.g. PDF with hidden text, Bib Link
Level 3: Structured accessible document e.g. XML, HTML(+MathML), LATEX, …
Level 4: (partially) Executable document e.g. Mathematica, Maple
Level 5: Formally presented document e.g. Mizar, OMDoc
![Page 9: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/9.jpg)
http://www.inftyproject.org/ 9
Different levels in digitization Level 1: Bitmap images of printed materials e.g. GIF, TIFF
Level 2: Searchable digitized document e.g. PDF with hidden text, Bib Link
Level 3: Structured accessible document e.g. XML, HTML(+MathML), LATEX, …
Level 4: (partially) Executable document e.g. Mathematica, Maple
Level 5: Formally presented document e.g. Mizar, OMDoc
WDML achieved this level.
![Page 10: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/10.jpg)
http://www.inftyproject.org/ 10
Different levels in digitization Level 1: Bitmap images of printed materials e.g. GIF, TIFF
Level 2: Searchable digitized document e.g. PDF with hidden text
Level 3: Structured accessible document e.g. XML, HTML(+MathML), LATEX, …
Level 4: (partially) Executable document e.g. Mathematica, Maple
Level 5: Formally presented document e.g. Mizar, OMDoc
Infty : Level 1 → Level 3
![Page 11: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/11.jpg)
http://www.inftyproject.org/ 11
Process Flow of Digitization
Layout Analysis : Segmentation of Areas (Text, Table, Figure)
Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)
Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.)
XML Outputs
LaTeX, XHTML+MathML, PDF, Braille codes, etc.
PDF Image File (TIF) Texts & Math symbols
![Page 12: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/12.jpg)
http://www.inftyproject.org/ 12
Layout Analysis
Segmentation of Areas (Text, Table, Figure)
Recognition per line (Character recognition, Math. Structure analysis)
Document Structure analysis (Title, Chapter, Section, Itemize, Theorem, Bib, etc.)
XML Outputs LaTeX. HTML,
Human readable TeX Braille codes, Speak data, etc.
PDF Image File (TIF) (Pre processing)
![Page 13: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/13.jpg)
http://www.inftyproject.org/ 13
Layout Analysis
Segmentation of Areas Table Analysis
Recognition per line (Character recognition, Math. Structure analysis)
Document Structure analysis (Title, Chapter, Section, Itemize, Theorem, Bib, etc.)
XML Outputs LaTeX. HTML,
Human readable TeX Braille codes, Speak data, etc.
PDF Image File (TIF)
![Page 14: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/14.jpg)
http://www.inftyproject.org/ 14
Process Flow of Digitization
Layout Analysis : Segmentation of Areas (Text, Table, Figure)
Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)
Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.)
XML Outputs
LaTeX, XHTML+MathML, PDF, Braille codes, etc.
PDF Image File (TIF) Texts & Math symbols
![Page 15: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/15.jpg)
http://www.inftyproject.org/ 15
Process Flow of Digitization
Layout Analysis : Segmentation of Areas (Text, Table, Figure)
Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)
Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.)
XML Outputs
LaTeX, XHTML+MathML, PDF, Braille codes, etc.
PDF Image File (TIF) Texts & Math symbols
![Page 16: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/16.jpg)
http://www.inftyproject.org/ 16
Process Flow of Digitization
Layout Analysis : Segmentation of Areas (Text, Table, Figure)
Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)
Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.)
XML Outputs
LaTeX, XHTML+MathML, PDF, Braille codes, etc.
PDF Image File (TIF) Texts & Math symbols
![Page 17: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/17.jpg)
http://www.inftyproject.org/ 17
Document Structure Analysis Detection of :
Title, Autor, Section, Subsection, Itemization, BibItem, Theorem, Lemma, etc.
- Currently, naïve methods are used: Line classification using the combination features such as: Character size, Font Information (Bold, Italic, Small Capital), Keywords, Indentation, Starting with Numbers or Special pattern (e.g. “[Num]”), etc.
- Stronger method is required in actual digitization.
Hyperlink inside document.
![Page 18: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/18.jpg)
http://www.inftyproject.org/ 18
Section 3 Current state of the art
with demonstration
![Page 19: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/19.jpg)
http://www.inftyproject.org/ 19
“InftyReader” OCR software for math documents
Demonstration… Math recognition (Already shown)
Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample
Matrices
Layout analysis, Table recognition
Logical structure analysis
![Page 20: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/20.jpg)
http://www.inftyproject.org/ 20
“InftyReader” OCR software for math documents
Demonstration… Math recognition (Already shown)
Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample
Matrices
Layout analysis, Table recognition
Logical structure analysis
![Page 21: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/21.jpg)
http://www.inftyproject.org/ 21
“InftyReader” OCR software for math documents
Demonstration… Math recognition (Already shown)
Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample
Matrices
Layout analysis, Table recognition
Logical structure analysis
![Page 22: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/22.jpg)
http://www.inftyproject.org/ 22
“InftyReader” OCR software for math documents
Demonstration… Math recognition (Already shown)
Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample
Matrices
Layout analysis, Table recognition
Logical structure analysis
![Page 23: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/23.jpg)
http://www.inftyproject.org/ 23
“InftyReader” OCR software for math documents
Demonstration… Math recognition (Already shown)
Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample
Matrices
Layout analysis, Table recognition
Logical structure analysis
![Page 24: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/24.jpg)
http://www.inftyproject.org/ 24
“InftyReader” OCR software for math documents
Demonstration… Math recognition (Already shown)
Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample
Matrices
Layout analysis, Table recognition
Logical structure analysis
![Page 25: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/25.jpg)
http://www.inftyproject.org/ 25
Section 4 Large Volume Recognition
![Page 26: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/26.jpg)
http://www.inftyproject.org/ 26
Large Volume Digitization
Adaptive method is efficient:
Get information from the target document: - Character features, - Math formula parameters, - Layout parameters, etc.
Recognition
or (Directly) After manual checking (Semi-automatic)
![Page 27: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/27.jpg)
http://www.inftyproject.org/ 27
Process Flow using BatchInfty & InftyReader pro 1. Noise reduction, centering, etc.
2. Trial recognition
3. Extraction features: - Document style → Logical structure analysis - Character cluster images → OCR engine
4. Recognition & verification
5. PDF output
Large Volume Digitization
![Page 28: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/28.jpg)
http://www.inftyproject.org/ 28
Generation of UserDictionary adapting OCR engine to the target documents.
Large Volume Digitization
Trial recognition
CharDataA: Centroides of the clusters of text characters with reliable score
CharDataB: Centroides of the clusters of math symbols and text characters with low score
User Dictionary of Character Features
(automatic) (manual correction)
Clustering of the character images
Show CharImageManager
![Page 29: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/29.jpg)
http://www.inftyproject.org/ 29
Generation of UserDictionary adapting OCR engine to the target documents.
Large Volume Digitization
Trial recognition
CharDataA: Centroides of the clusters of text characters with reliable score
CharDataB: Centroides of the clusters of math symbols and text characters with low score
User Dictionary of Character Features
(automatic) (manual correction)
Clustering of the character images
Show CharImageManager
![Page 30: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/30.jpg)
http://www.inftyproject.org/ 30
Section 5 Open Problems
![Page 31: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/31.jpg)
http://www.inftyproject.org/ 31
Problems Further improvement of character/symbol
recognition and structure analysis of math expressions. Touched characters, Broken characters in math area
Low resolution image
Different type face (Old books, typewriter prints, etc.)
Bold char detection in math area
![Page 32: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/32.jpg)
http://www.inftyproject.org/ 32
Problems Logical Structure Analysis (Automatic detection
and manual correction) --- still difficult! Title, Autor, Section, Subsection, Itemization, BibItem,
Theorem, Lemma, etc.
Hyperlink inside document.
![Page 33: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/33.jpg)
http://www.inftyproject.org/ 33
Problems Detection/Analysis of Figures and Tables Detection of characters in figures
Table structure analysis (Sample)
Diagram recognition
Chemical diagrams ← Recently developing world wide
(Commutative diagrams) ← Future work
![Page 34: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/34.jpg)
http://www.inftyproject.org/ 34
Problems Detection/Analysis of Figures and Tables Detection of characters in figures
Table structure analysis (Sample)
Diagram recognition
Chemical diagrams ← Recently developing world wide
(Commutative diagrams) ← Future work
![Page 35: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/35.jpg)
http://www.inftyproject.org/ 35
Problems Detection/Analysis of Figures and Tables Detection of characters in figures
Table structure analysis (Sample)
Diagram recognition
Chemical diagrams ← Recently developing world wide
(Commutative diagrams) ← Future work
![Page 36: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/36.jpg)
http://www.inftyproject.org/ 36
Conclusion InftyProject.
Research group of math information processing. Demo (InftyReader) to show the current state of
the art. Adaptive method to improve character and
symbol recogition (CharImageManager). Proposed some problems to be attacked.
![Page 37: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus](https://reader030.vdocuments.us/reader030/viewer/2022040312/5e03ba0265f88a0ad6203889/html5/thumbnails/37.jpg)
http://www.inftyproject.org/ 37
“INFTY” an integrated OCR for mathematical documents
Thanks you!
Masakazu Suzuki [email protected] (current address) [email protected] (permanent address)
InftyProject: http://www.inftyproject.org/en/ Science Accessibility Net: http://www.sciaccess.net/en/