ai based character recognition and speech synthesis

Seminar on

“ AI Based Character Recognition and Speech Synthesis”

Developed By:

Kalyani Hadke Rani Kubetkar

Shreya Surjuse Ankita Jadhao

Kruttika Sorte

Guided By

Prof. H. N. Datir

Artificial Intelligence based

Character Recognition and Speech Synthesis

NEED!!!We are facing so many problem in our daily life like, if we capturing the image some time we can not get proper image and not recognize the words.Lots of people have the problem of illiteracy .So we wish that this image should be converted to text for various purposes.While studying, we don’t read the text as a regular practice. So we wish that this text can be converted into audio.Apart which we wish should be captured in image & converted into audio.As generally we prefer hearing songs,

Introduction to CR and SS

• Optical Character Recognition (OCR) is an electronic or mechanical converter.

• OCR converts scanned images or text into machine code.

• Speech Synthesis is the artificial production of human speech.• Speech synthesizer – a computer system used for this purpose.• TTS engine performs:• Language into speech• Symbolic linguistic representation to speech

• Image

OCR

• Recognized text

TEXT• Speech

engine

speech

• Image

OCR

• Recognized text

TEXT• Recognized

text

TEXT• Speech

engine

speech

Overview

DFD For Character Recognition System

Pre-Processing Segmentation Image Digitization

Network ImplementationTraining of Learning Network

recognition Network testing

Pre-processing explanation

De-noising

De-skew

Binarization

Pre-processing

Image segmentation Decompose sequence of characters in individual

symbols. Directly affects the rate of recognition of script. Locate and identify boundaries of image.

1. External segmentation2. Internal segmentation

SEGMENTATION

. .

Image segmentation is the process of partitioning an image into multiple segments ,so as to change the representation of an image into something that is more meaningful and easier to analyze.

1

23

4

. External Segmentation: determine the character lines in the text.

Image segmentation is the process of partitioning 1

I m a g e

Internal Segmentation: decompose an image of sequence of characters to images of individual symbols

• Mapping of symbol image into a corresponding two dimensional binary matrix

• Issue – deciding the size of matrix• Sampling strategy for mapping the symbol

image

Image Digitization - Matrix matching

Input alphabet ‘ a ‘

0

0 0

0

0

0

0

0

0

0 0

0

0

0

0

0

0

1

1 1

1

1

1

1

1

1

1

1 1 1

Segmented grid

Digitization

• To feed matrix data to the network it must be linearize to a single dimension

0

0 0

0

0

0

0

0

0

0 0

0

0

0

0

0

0

1

1 1

1

1

1

1

1

1

1

1 1 1

…………...0 1 1

N

A

M

E

NAME

001110100….

111010011….

11001100….

000111101…..

NAMENEURAL

NETWORK

14

1

13

5Image of scanned document

Sub-images of individual letter from document

Binary representation of sub-images. E.g 0 is white and 1 is black.

A supervised neural network that has been trained to recognize images of characters.

Neural network output numeric values corresponding to the recognized characters.

File contains the text of the scanned document.

Artificial neural network consists of a large number of highly interconnected processing elements (neurons) working in unison to solve specific problems analogous to the biological neurons in the brain. Neurons communicated with weighted links

NEURON NEURONWeighted link

X1

Xn

Output

Wk1

Wkp

SummationSigmoid function

• Feed-forward neural network • A multilayer perceptron • Teaching and adaption of ANN• Implementation the ANN

Neural Network

Input SignalOutput signal

Input layerFirst hidden layer

Second hidden layerOutput layer




Recognition Network testing


Neural Network

Input SignalOutput signal

Binary converted image

Obtained text of scanned image

Back-propagation for Error calculationERROR

N

A

M

E

NAME

001110100….

111010011….

11001100….

000111101…..

NAMENEURAL

NETWORK

14

1

13

5

Sub-images of individual letter from document

Binary representation of sub-images. E.g 0 is white and 1 is black.

A supervised neural network that has been trained to recognize images of characters.

Neural network output numeric values corresponding to the recognized characters.

File contains the text of the scanned document.

Image of scanned document

Speech Synthesis



Input Image

File containing

Text of scanned document

NLP DSP SPEECH

TEXT

TTS Engine

• TTS-Text to Speech engine• a computer-based system that read any text

aloud.• TTS engine consist of Front-end - NLP Back-end -DSP

Speech Synthesis

Modules of Text-to-Speech

Natural language processing

Text PreprocessingText Analysis

Linguistic Analysis

Digital signal

processing

SpeechSynthesizer

TEXT SPEECH

Prosody

Phonemes

Figure 1. A simple but general functional diagram of a TTS system

Input Output

Speech Synthesis



Input Image

File containing


NLP DSP SPEECH

TEXT

TTS Engine

TEXT ANALYSIS

Auto PHONEME

Prosody Generation

• This step called high-level, front-end or text-to-phoneme.

• It consists of the following parts: Text analysis Automatic Phonetization Prosody generation

NLP Module

Speech Synthesis



Input Image

File containing


NLP DSP SPEECH

TEXT

TTS Engine

TEXT ANALYSIS

Auto PHONEME

Prosody Generation

NLP Module

Text Analysis

A pre-processing

A morphological analysis

A contextual analysis

A syntactic-prosodic

Text analysis

Speech Synthesis



Input Image

File containing


NLP DSP SPEECH

TEXT

TTS Engine

TEXT ANALYSIS

Auto PHONEME

Prosody Generation

NLP Module

Automatic Phonetization

Rule-Based

Dictionary-based

Hybrid-approach

Automatic Phonetization

Speech Synthesis



Input Image

File containing


NLP DSP SPEECH

TEXT

TTS Engine

TEXT ANALYSIS

Auto PHONEME

Prosody Generation

Formantsynthesis

Concatenative synthesis

NLP Module

Prosody Generation

Pitch

Intonation

Ryhthm

ProsodyGeneration

Speech Synthesis



Input Image

File containing


NLP DSP SPEECH

TEXT

TTS Engine

TEXT ANALYSIS

Auto PHONEME

Prosody Generation

Formantsynthesis


DSP component• Low level phoneme to speech• There are two main technologies used for the

generating synthetic speech waveforms: • Concatenative synthesis • Formant synthesis

Speech Synthesis



Input Image

File containing


NLP DSP SPEECH

TEXT

TTS Engine

TEXT ANALYSIS

Auto PHONEME

Prosody Generation

Formantsynthesis


Formant Synthesis• Formant synthesis – rule-based synthesis• does not use any human speech samples at runtime.• Wave-form created using an acoustic model of the

human vocal tract.• Generates artificial, somewhat robotic speech

Speech Synthesis



Input Image

File containing


NLP DSP SPEECH

TEXT

TTS Engine

TEXT ANALYSIS

Auto PHONEME

Prosody Generation

Formantsynthesis



• Based on the concatenation of segments of recorded speech.

• Gives the most natural sounding synthesized speech.

Concatenative Synthesis

Diphone Concatenation

Synthesis

Unit Concatenation

Synthesis

Somewhat robotic speech, sonic glitches natural speech

SUBTYPES

• Unit Concatenation Synthesis– Algorithm

• Break language down to small units (phonemes, syllables, etc.)• Create a large database of recorded speech• Each unit is labeled: pitch, duration, prosody, position in syllable, etc.

Labeling is synthesizer-dependant• Target utterance is selected at runtime by determining the best chain

of units (HMM, Decision Tree)• Use DSP to smooth transitions between units

Approaches To Wave-form Generation Concatenative



Input Image

File containing


NLP DSP SPEECH

TEXT

TTS Engine

TEXT ANALYSIS

Auto PHONEME

Prosody Generation

Formantsynthesis


Advantages• Machine Language Translation

• Information Retrievals

• Visual Issue (Difficulty seeing text)

• Motor Issue(Difficulty handling a book or paper)

QUESTIONS????

ai based character recognition and speech synthesis

Engineering