![Page 1: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/1.jpg)
Projects
CS 661
![Page 2: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/2.jpg)
DAS 02, Princeton, NJ• OCR Features and Systems
– Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks, traffic ticket reading
• Handwriting Recognition– Stochastic models, holistic methods, Japanese OCR
• Classifiers and Learning– Multi-classifier systems
• Layout Analysis– Skew correction, geometric methods, test/graphics separation, logical
labeling
• Tables and Forms– Detecting tables in HTML documents, use of graph grammars, semantics
• Text Extraction• Indexing and Retrieval• Document Engineering• New Applications
– CAPTCHA, Tachograph chart system, accessing driving directions
![Page 3: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/3.jpg)
ICDAR 03, Edinburgh, UK
• Multiple Classifiers• Postal Automation and Check Processing• Document Understanding• HMM Classifiers• Segmentation• Character Recognition• Graphics Recognition• Non-Latin Alphabets- Kanji/Chinese, Korean/Hangul,
Arabic/Indian• Web Documents, Video• Word Recognition• Image Processing• Writer Identification• Forms and Tables
![Page 4: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/4.jpg)
Project Assignments
Faisal Farooq Multilingual Digital Library- Indexing, Retrieval, Script discrimination
Swapnil Khedekar Multilingual document layout analysis, OCR
Kompalli Surya Multilingual OCR using HMMs
Lei Hansheng Off-line and on-line handwriting integration and matching
Sumit Manocha Fingerprint image enhancement and minutiae extraction
Lin Yu-Hsuan ** Multiple Classifier Combination- multiple modlaities
Praveer Mansukhani Interactive Handwriting Recognition Model
Amalia Rusu Handwritten Captchas
Sutanto Adi ** Indirect biometric data extraction from medical forms
![Page 5: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/5.jpg)
Multilingual Digital Library
![Page 6: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/6.jpg)
Query Result
Control Panel
Query Input
Telugu and Arabic modules under development
![Page 7: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/7.jpg)
Multilingual DIA and OCR
![Page 8: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/8.jpg)
Text/Image Separation
Intervals between peaks
![Page 9: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/9.jpg)
Line Separation• Ascenders & descenders interfering with lines
• Region-growing approach• In Devanagari, single word is a single
connected component• Grow regions using horizontally adjacent
components
![Page 10: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/10.jpg)
Word Separation
• In Devanagari, all characters in a word are glued together by Shirorekha
• Vertical Projection profile easily separates words
![Page 11: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/11.jpg)
Multilingual OCR using HMMs
![Page 12: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/12.jpg)
Continuous Attributes
grapheme pos orientation angle
Down cusp
3.0 -90o
Up loop
Down arc
![Page 13: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/13.jpg)
Stochastic Model
![Page 14: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/14.jpg)
Observations
![Page 15: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/15.jpg)
Integrating Online and Offline Handwriting Recognition
![Page 16: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/16.jpg)
Structural FeaturesBAG
JunctionLoops
LoopTurns
End
End
![Page 17: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/17.jpg)
Feature Extraction and Ordering
Critical node: removal disconnects a connected component.
2-degree critical nodes keep feature ordering from left to right.
LeftComponent
RightComponent
Loop
EndTurns
Junction
LoopsEnd
Turns
![Page 18: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/18.jpg)
Fingerprint Enhancement and Feature Extraction
![Page 19: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/19.jpg)
Fingerprint Recognition
Orientation maps and minutiae detection
![Page 20: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/20.jpg)
Preprocessing Operations
Filtering
•Image Enhancement
•Image Segmentation
•Correlation among fingers
![Page 21: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/21.jpg)
Multiple Classifier Systems
![Page 22: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/22.jpg)
Combination and Dynamic Selection[Govindaraju and Ianakiev, MCS 2000]
WR 1
WR 2
WR 3+Lexicon
1
Top 5
<55Top 50
image
•Optimization problem
•Combinatorial explosion in
•arrangement of recognizers
•lexicon reduction levels
![Page 23: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/23.jpg)
Lexicon Density[Govindaraju, Slavik, and Xue, IEEE PAMI 2002]
Lexicon 1 Lexicon 2
Me MeHe MemoSo MemoryTo MemoirsIn Mellon
![Page 24: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/24.jpg)
Interactive Handwriting Recognition
![Page 25: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/25.jpg)
Handwriting Recognition
Context Ranked Lexicon
![Page 26: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/26.jpg)
Multiple Choice Question
ContextRanked Lexicon
![Page 27: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/27.jpg)
Interactive Models[McClelland and Rumelhart, Psychological Review, 1981]
ABLE TRIPTRAP
A TN
Words
Letters
Features
![Page 28: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/28.jpg)
Handwritten CAPTCHAs
![Page 29: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/29.jpg)
“CAPTCHAs”: Completely Automated Public Turing
Tests to Tell Computers & Humans Apart
• challenges can be generated & graded automatically (i.e. the judge is a machine)• accepts virtually all humans, quickly & easily• rejects virtually all machines• resists automatic attack for many years (even assuming that its algorithms are known?)
NOTE: the machine administers, but cannot pass the test!L. von Ahn, M. Blum, N.J. Hopper, J. Langford, “CAPTCHA: Using Hard AI Problems For Security,” Proc., EuroCrypt 2003, Warsaw, Poland, May 4-8, 2003 [to appear].
![Page 30: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/30.jpg)
Yahoo!’s present CAPTCHA: “EZ-Gimpy”
• Randomly pick: one English word, deformations, degradations,
occlusions, colored backgrounds, etc
• Better tolerated by users• Now used on a large scale to protect various
services• Weaknesses: a single typeface, English lexicon
![Page 31: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/31.jpg)
Indirect Biometrics from Medical Forms Images
![Page 32: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/32.jpg)
Hard biometrics
Face
Eye :Retina & Iris
Fingerprint
Hand Geometry
Handwriting
Speech
DNA
Soft biometrics
Age
Ethnicity
Nationality
Build
Gait
Mannerisms
Writing style
(Semantic)
Derived biometrics
Text/News
WWW
Indirect biometrics
Driver’s License
Medical Records
INS Forms
The Biometrics Spectrum
•Biometric Consortium (www.biometrics.org) lists several products:
–Faces (30); Fingerprints (50); Hand geometry (30); Handwriting (5); Iris (5); Multimodal (6); Retinal (2); Vein (3); Voice (22); Other (20)
–NONE on soft biometrics
–NONE on the fusion of indirect and derived biometrics
![Page 33: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/33.jpg)
NYS EMS PCR FormNYS PCR Example
Thousands are filed a day.Passed from EMS to Hospital.
PCR Purpose:– Medical care/diagnosis– Legal Documentation– Quality Assurance
EMS AbbreviationsCOPD Chronic Obstructive Pulmonary DiseaseCHF Congestive Heart FailureD/S Dextrose in SalinePID Pelvic Inflammatory DiseaseGSW Gunshot WoundNKA No known allergiesKVO Keep vein openNaCL Sodium Chloride
![Page 34: Projects CS 661. DAS 02, Princeton, NJ OCR Features and Systems –Degradation models, script ID, Bilingual OCR, Kannada OCR, Tamil OCR, mp versus hw checks,](https://reader030.vdocuments.us/reader030/viewer/2022032723/56649d145503460f949e93d5/html5/thumbnails/34.jpg)
Medical Text Recognition and Data Mining