development of an ocr system nathan harmata tjhsst computer systems lab 2007-2008

12
Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Upload: jean-barrett

Post on 13-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Development of an OCR System

Nathan Harmata

TJHSST Computer Systems Lab2007-2008

Page 2: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

What is OCR?

Optical Character Recognition

Font and handwriting based

Page 3: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Goals of My Project

Generic recognition for Latin-based fonts

Proper handling of most formatting

System built from scratch

Page 4: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Overview of Idocrase System

Page 5: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Image Processing

Page 6: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Transformations

Attribute

Character Model

Page 7: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Transformations

Sector Vector - image is parsed into parts that pass the vertical line test

- then each part is transformed into a collection of line segments

Gap Vector - gaps, if any, are found on the four sides of the image

Page 8: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Transformations

Pixel Concentration Vector – which sides, if any, have a higher concentration of pixels

Page 9: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Character Recognition

GCDD – Generic Character Definition Database

Averages of Character Models for every character from many different fonts

0 PixelConcentrationVector balanced balanced SectorVector 4 3 GapVector

Page 10: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Character Recognition

For a single character:

For words, dictionary and grammar references are used.

Page 11: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Idocrase Application

Page 12: Development of an OCR System Nathan Harmata TJHSST Computer Systems Lab 2007-2008

Results

-Mediocre word recognition-Doesn’t handle formatting well-Doesn’t handle small letters well-Fairly accurate single character recognition (93.7%)