offline extraction of indic regional language from natural scene … · 2018-07-10 ·...

1

Offline Extraction of Indic Regional Language from Natural Scene Image using Text

Segmentation and Deep Convolutional Sequence

1Sauradip Nag,

1Pallab Kumar Ganguly,

1Sumit Roy,

1Sourab Jha,

1Krishna Bose ,

1Abhishek Jha and

2Kousik Dasgupta

1Kalyani Government Engineering College, Kalyani, Nadia, India 2Faculty of Computer Science and Engineering, 1Kalyani Government Engineering College, Kalyani, Nadia, India

[email protected], [email protected] , [email protected] ,

[email protected] , [email protected] , [email protected] ,

[email protected]

Abstract : Regional language extraction from a natural scene image is always a challenging proposition due to its dependence

on the text information extracted from Image. Text Extraction on the other hand varies on different lighting condition, arbitrary

orientation, inadequate text information, heavy background influence over text and change of text appearance. This paper pre-

sents a novel unified method for tackling the above challenges. The proposed work uses an image correction and segmentation

technique on the existing Text Detection Pipeline an Efficient and Accurate Scene Text Detector (EAST). EAST uses standard

PVAnet architecture to select features and non maximal suppression to detect text from image. Text recognition is done using

combined architecture of MaxOut convolution neural network (CNN) and Bidirectional long short term memory (LSTM) net-

work. After recognizing text using the Deep Learning based approach, the native Languages are translated to English and to-

kenized using standard Text Tokenizers. The tokens that very likely represent a location is used to find the Global Positioning

System (GPS) coordinates of the location and subsequently the regional languages spoken in that location is extracted. The pro-

posed method is tested on a self generated dataset collected from Government of India dataset and experimented on Standard

Dataset to evaluate the performance of the proposed technique. Comparative study with a few state-of-the-art methods on text

detection, recognition and extraction of regional language from images shows that the proposed method outperforms the existing

methods.

Keywords: Segmentation , Text Recognition, Language extraction , Deep Learning , Bi-Directional LSTM , Tokenization,

Neural Network.

1. Introduction

As internet and use of social media increases, the rise of easily accessible multimedia data, such as video, images and text created

unprecedented information overflow. In contrast to this, information extraction from Natural Scene is still a daunting task and

currently active area of research mainly because of the complexities faced during extraction. India being a vast country has a wide

span of uncovered regions and various colloquial languages. Currently India has twenty six (26) different regional languages,

many of these languages are native languages of mainly rural areas of India. So, extraction of these regional language from rural

area is itself a challenging task, since the rural areas generally do not contain well defined address. Recent studies explored the

extraction of Devanagari Scripts from text [8] but these do not give information about the specific regions where these languages

are spoken, or the approximate location of the image. Modern applications focus on using metadata of an image to find the ground

location of the image and from there find the accurate regional language. However, if images are captured using digital single-lens

reflex (DSLR) or Digital Camera then the images may not be geotagged, so in such cases it will fail to get location. Therefore, it

may be concluded that it is very much essential to extract location information from natural scene image offline, as it will be nov-

el in its kind. Thus, the work in this paper focusses on extraction of location from natural scene image having address information

offline and subsequently finding regional language of that region. This is interesting since we are using a text candidate segmenta-

tion, which outperforms the existing methods in performance in both low and high resolution. We are doing the entire process

offline by training the proposed text recognition model for different regional languages. A sample collection of images has been

generated from the Government of India dataset [14] for experimental study. Some sample images are shown in Fig. 1. The main

contribution of proposed work is

Using image correction for low quality images and using existing deep learning method to improve performance.

mailto:[email protected]






Using deep convolutional sequence to detect Indic languages as shown in Fig 1.

Detecting Location from a Image having Address Information in native languages

Detecting the regional languages from its address information.

The rest of the paper is laid out as follows: In Section 2, we discuss some related works regarding detection of text and text recog-

nition using both traditional and Deep Learning based methods and finally extraction of language from it. In Section 3, the pro-

posed text segmentation and connectionist text proposal network for Indic languages have been described alongside regional lan-

guage extraction from location information. In Section 4, we discuss the details of dataset, the proposed experimental setup, com-

parison of proposed and existing methods and results in details. Section 5 points out the efficacy and drawbacks of the proposed

approach, some direction towards future work and concluding with suggestive applications.

2. Related Work

Since natural scene image is occluded by various lightning condition, text orientation, inappropriate text information, so extrac-

tion of text candidate region from the natural scene image is very important in this literature. Tian et al. [1] proposed detecting

text in natural image with connectionist text proposal network. The method detects a text line in a sequence of fine-scale text pro-

posal directly convolutional feature maps. This method explores context information of the image for successful text detection.

The method is reliably on multi-scale, multi-language text. Shi et al. [4] proposed detecting oriented text in natural images by

linking segments. The idea here is that the text is decomposed into segments and links. A segment is an oriented text box covering

part of text line while links connects two adjacent segments, indicating that they belong to the same text line. To achieve this, the

method explore deep learning concept. Zhou et al. [5] proposed an efficient and accurate scene text detector. The method predicts

text line without detecting words or characters based on deep learning concepts, which involves designing the loss of functions

and neural network architecture. The method works well for arbitrary orientations. He et al [18] proposed Deep Text Recurrent

Network (DTRN), which generates ordered high level sequence from a word image using Maxed out CNN and then uses Bi-

LSTM to detect the generated CNN sequence. These works on text detection on natural scene images considers high-resolution,

high-contrast images. These methods directly or indirectly explore connected component analysis and characteristics at character

level for successful text detection of arbitrary orientation, multi-language, font face, font size etc. However, the performance of

the methods degrades when the image contains both high and low contrast text. To overcome the problem of low contrast and

resolution, the proposed method is effective as the Image with poor quality is fixed using Image Correction techniques and then

uses EAST[5] to detect Text Boxes .It also uses convex hull to generate orientation wise correct text segmentation. The work in

this paper presents a new method of text candidate region selection and segmentation by exploring. This work explores the prob-

lem of regional language detection from both low quality and high quality images. Since Deep Learning based approaches per-

form better than conventional approaches, so Neural Network based Architecture is used for both Text Detection and Text Recog-

nition which is elaborated in details in the upcoming sections.

Hindi

English

Tamil

Urdu

Bengali

Fig. 1. Represents Sample Images in Different Languages having Address Information

3

3. The Proposed Methodology

In this work a RGB image having at least 1 line of address Information is taken as input. This input image is then preprocessed for

image correction using techniques like applying Gaussian Blur to reduce noise and applying Gamma Filter for enhancing contrast

for night images. Then the Meta information is checked for GPS coordinates , if it is present then we proceed to find Regional

Language.If not present we pass this corrected Image into EAST [5] Text Detction Pipeline which uses standard PVAnet and Non

Maximum Supression Layer as a part of its Architecture. This returns the Bounding Box over probable Text Region . We binarize

the image and perform periodic growing of the boxes on both sides until neighboring boxes overlap. In this way we make sure

that the Adjacent Information is not lost . Still due to orientation some important information may be lost , so for that we apply

Polygonal Convex Hull to correct the Segmentation and subsequently mask out the Color Image . After Segmenting the Text ROI

from the Image, we use DTRN [18] since Address information can be in any font, we chose a method that is not dependent on

Nature of Text. Motivated by the work of He et al [18] we use MaxOut CNN to get High Level Features for Input Text instead of

grueling task of Character Segmentation from input since it is not Robust and Trustable. After this step as given in Figure 2, these

high-level features are passed into Recurrent Neural Network architecture .Since we need address information which is quite sen-

sitive to the context of a line , it may suffer from vanishing gradient problem . So Bi-LSTM is used since it solves vanishing gra-

dient problem and it can learn meaningful long-range interdependencies which is useful in our context especially when our main

task is to retrieve Address Information from Text Sequence. A Connectionist Temporal Classification (CTC) Layer has been add-

ed to the output of LSTM Layers for sequential labelling. We have trained this architecture for Hindi and Telugu which is our

contribution. We pass the same input image for all the 3 language Architecture (English, Hindi, Telugu). The softmax Layer

which is added in the End of MaxOut CNN Layers returns confidence scores for detection of High Level Features. We pass high

level features of that Architecture whose softmax score is above some threshold. After recognition, we detect the language using

Google’s Offline Language Detector Module [28] and subsequently translate to English for better recognition rate.

The proposed Approach then uses Moses Tokenizer [19] to tokenize the text as this tokenizer tokenizes the words without punctu-

ation from raw translated text. Each of these tokens/words are queried individually on Post Office Database collected from Gov-

School

,

Image Database Image Prepocessing

for Correction

ROI

Selection

ROI Merging and

Segmentation

राजकीय प्राथमिक

मिद्यालय भीिों की

ढाणी खाटािास, पं.

स. लूणी,जोधपुर।

State Primary School,

Bhimon Dhanani

Khatavas, Pt. Luni,

Jodhpur

Automatic

Language Detection

and Translation

Stat

e

Prima-

ry

School

,

Regional Language

Database

Bhimon

, Dhanani

State Khatavas,

State Pt

Sta

Luni

Jodhpur

Is words[i] a

Location ?

Jodhpur

Bhimon ,

Dhanani State

Primary State

Khatavas, State

Pt Sta

State

Latitude : 26.29

Longitude : 73.02

Pincode : 342802

Offline Reverse

Geocoder

Languages Spoken :

Hindi , Marwari

Location

Database

La

ng

ua

ge

Ex

tra

cto

r E

ng

ine

Text

Recognition

Fig. 2.: Framework of the Proposed Method

Luni

State

Accepted Words

Rejected Words

Text Tokenization

Geotag Images

Save to

Database

Query Query

Yes

No

i = i + 1

Meta-Data from Images

ernment Of India to obtain the GPS coordinates of all the common post office of the locations found using Tokenization. After

obtaining (Pincode, Longitude, and Latitude) for a particular image, we use reverse geocoding [29] and database RLDB created

by collecting Regional Language Information found in [13]. From this, we obtain the Regional Language Spoken for the particu-

lar location found in Input Image and finally we Geotag the Images and save them. The complete framework of the proposed

method can be seen in Fig 2.

3.1 Image Correction

Any picture from natural scene can contain certain amount of noise, blur and varying lightning condition, which may affect any

method for text candidate selection. So extra care must be taken to ensure that the performance must not drop irrespective of the

quality of picture if it has at least one address information. So, to correct the Image before text detection , the input image is pre-

processed. Bright or Dark Image is identified using Histogram of Intensity.If the input Image is Night Image(dark) then a Gamma

Filter with Gamma Value 2.5 is applied. The value 2.5 is empirical and is obtained after proper experimentation. So, this gamma

value is used to correct it as shown in Fig 3(b). Noise is removed from color image by using non-local de-noising method [9] and

unsupervised Weiner Filter [10] is used to de-blur the de-noised image as shown in Fig 3(c).

3.2 Text Detection and Candidate Segmentation

Text Detection in Natural Scene has been an area of research for some years in the Recent Past . Conventional Approaches Rely

on manually designed features like Stroke Width Transform (SWT) , MSER based methods which identifies character candidates

using Edge Detection or extremal region Extraction or contour features .Deep Learning Methods outperform these methods as

these methods fall behind in terms of robustness and accuracy .

Text Detection : Since Zhou et al. [5] proposed EAST which predicts text line without detecting words or characters based on

deep learning concepts , this approach is suitable for Extracting Information like Text in Multiple Fonts . EAST[5] also imple-

mented Geometry Map Generation which can Detect Text in Arbitrary Orientation without losing Text information with the help

of rotating box (RBOX ) and Quadrangle (QUAD). But EAST[5] fails for Low Quality Images since Trained Model used High

Quality Images from Standard Datasets . As mentioned in Section 3.1 , the Low Quality images are improved and fed to the Text

Detection Pipeline . The pipe line proposed by Zhou et al. [5] consists of only 2 stages namely Fully Convolutional Neural Net-

work (FCN) and Non-Maximum Suppression ( NMS ) . The FCN internally uses a Standard Neural Network Architecture called

PVAnet which consist of 3 internal Layers : Feature Extraction Layer , Feature Merging Layer and Geometry Map Generation for

Bounding Box which is after the Output Layer as depicted in Fig 4 .The input image to this step is the Uncorrected Low/High

Quality Images . The Input image is then resized and fed to the Neural Network Pipeline where it extracts the Feature relevant to

text in 1st Layer , 2

nd Layer it merges the features and extract possible text region mapping using Geometry Map Generation losses

. After This step the NMS Layer selects the maps which are only for text and marks a Bounded Box around it as shown in Fig

4(c).

(a) Input Image (b) Gamma Corrected Image (c) Noise and Blur Corrected Image

Fig. 3. Depicts how Input Image is preprocessed before Text Detection

5

Text Segmentation : Since the Architecture mentioned Above selects individual Boxes it may be difficult to extract location

information . Since each location connected by ‘-‘(dash) may represent a location and EAST alone will only detect the two con-

nected words individually which may create ambiguity in location extraction . Hence we need to process the Image and segment

out the Text Candidate Region. The Proposed Pipeline thus contains 3 key steps : Image Correction , Text Detection , Merging

and Segmentation as shown in Fig 5 . Input Image is corrected by maintaining Intensity , Blur Correction and Denoising . This

image is then passed onto the Existing EAST[5] Pipeline as mentioned in Fig 4 . This FCN takes corrected images as input and

outputs dense per pixels prediction of Text Lines . This eliminates the time consuming Intermediate Steps which makes it a fast

text detector. The post processing steps include thresholding and NMS on predicted Geometric Shapes . NMS is a post processing

algorithm responsible for merging all detections that belong to the same object . The output of this stage is bounded Box on the

text regions as shown in Fig 5(d).

After Formation of Bounding Boxes over the Image as shown in Fig 6(c) the box sizes are increased in a periodic manner on both

the Sides until it overlaps with neighboring bounding box . In this way , we make sure that the Neighboring Textual Information is

not lost . This is represented in Fig 6(d). After this operation, we observe that the segmented image is not uniform (in Fig 6.(d) )

and has lost Information which may affect the overall results since address information is interlinked and order of text matters. So

the proposed approach first binarizes the incorrect segmentation and then uses Polygon Convex Hull as in Fig 6.(f) to correct the

Image and the text information which was lost due to orientation , and box merging are recovered . Finally, we masked out the

region from Color Input Image as shown in Fig 6(g) and Fig 5 (e) . Now this will be the input to the Text Recognition Stage .

Feat

ure

Ext

ract

or

Feat

ure

Mer

ger

Ou

tpu

t La

yer

Geo

met

ry M

ap

NM

S La

yer

(a) Input (b) PVAnet Architecture (c) Output

Fig. 4. Architecture of Text Detection Pipeline of EAST[5]

Imag

e C

orr

ecti

on

Segm

enta

tio

n

(a) Input (b) Corrected (d) Detected Text (e) Output (c) EAST [5]

Fig. 5. Architecture of Proposed Text Detection Pipeline

Existing EAST

Architecture

Now the Text Recognition Method will not search the entire Image for text instead it will search on a smaller region which makes

this method quite Fast and Accurate . The whole process of Execution takes less than 15 seconds for images having size of 2000 x

1500 on a Intel Core i5-5200U CPU @ 2.20GHz Processor with 8GB Ram . Hence it segments out the Text in a time which is

almost equivalent to the time taken by EAST[5] to process a image individually .

3.3 Text Recognition and Key Word Extraction

Detecting Text from Natural Scene is a daunting task due to the changing background in different images and other external fac-

tors. Several works have been done on detecting text from Image using both traditional methods and more recently with Deep

learning methods. But in most cases the Deep Learning methods outperforms the traditional methods due to the complexity and

orientation. However, it must be noted that the images used for training and testing the Deep Learning Based Methods require a

high DPI, so images must not be of low resolution. The modern approaches use Deep Learning Based Text Detection and Text

Recognition Pipeline [5]. However, these methods fail to detect a candidate region in low resolution images. But the proposed

approach can detect text candidate in both high and low-resolution images better than other existing approaches and for text

recognition purpose we have used Deep Learning Based Text Recognition Method as proposed by Tian et al. [1]. Since the meth-

od with connectionist text proposal network detects a text line in a sequence of fine-scale text proposal directly convolutional

feature maps. Unlike the problem of OCR, scene text recognition required more robust features to yield results comparable to the

transcription-based solutions for OCR. A novel approach combining the robust convolutional features and transcription abilities of

recurrent neural networks as introduced by [37]. DTRN [18] uses a special CNN called MaxOut CNN to create High Level Fea-

tures and RNN is used to decode the Features into word Strings. Since maxout CNN is best suited for character classification we

use this on entire word image instead of whole image approach. We pass the bounding box text information of the boxes which lie

in the Segmented Text Region shown in Fig 6(g) to this MaxoutCNN for extracting High Level Features for Extraction . The Fea-

tures are in the form of sequence { 𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 } . These CNN Features { 𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛} are fed to the Bi-

Directional LSTM in each time sequence , and produces the output 𝑝 = { 𝑝1 , 𝑝2 , 𝑝3 , … , 𝑝𝑛}. Now the length of input se-

quence X and length of output sequence p is different , hence a CTC Layer is used which follows the equation :

𝑆𝑤 ≈ 𝛽(arg max𝜋 𝑃(𝜋|𝑝)) (1)

Where 𝛽 is projection which removes repeated labels and non-character labels and 𝜋 is the approximate optimized path with max-

imum probability among LSTM outputs . For example , 𝛽(− − 𝑑𝑑𝑑 − 𝑒𝑒 − 𝑙 − ℎℎ − − − 𝑖 −) = 𝑑𝑒𝑙ℎ𝑖 , hence it produces se-

quences 𝑆𝑤 = {𝑠1 , 𝑠2 , 𝑠3 , … , 𝑠𝑛} which are of same length as Ground Truth Input . India being a vast Country has a diverse

spread of Language [13], since training data collection for these languages are challenging, for the present we have focused to use

(a) Input Image (b) Corrected Image (c) Bounded Box Text Image (d) Text Merged Image

(e) Binarized Merged Image (f) Convex Hull Applied (g) Final Output Image

Fig. 6. Represents Step by Step Process of Detecting Text from Input Image and Segmenting it for Text Recognition Step

7

English, Hindi and Telegu Text Recognition in this work. For training the CNN part of DTRN [18] we collected individual char-

acter Images from various datasets [14-17] and we trained the Maxout CNN individually for English, Hindi and Telegu. For train-

ing samples, we collected individual character from [15], [16], [17] for English Texts, Hindi and Tamil were obtained from [14],

[17] and publicly available images in Google. In total we trained characters of multiple fonts, each class has 10 different sample

of varying width size and style which were collected from around 2000 Images for English, 1100 for Telegu and 1800 for Hindi

Scripts. For training RNN part, we took the images from the training samples [14-17] and cropped out word images individually

and collected 3700 words for English, 1900 for Telegu and 2900 for Hindi Language. Although, He et. al. [18] proposed DTRN

for word recognition, we have extended this to sentence recognition with good accuracy. The Hyperparameters used are : Learn-

ing Rate fixed to 0.0001 and momentum of 0.9. The Loss Function used in this Recurrent Layers is defined in Equation(2) as

shown below :

𝐿(𝑋, 𝑆𝑤) ≅ − ∑ 𝑙𝑜𝑔(𝑃(𝜋𝑡|𝑋))𝑇𝑡=1 (2)

where 𝜋 ∈ 𝛽−1(𝑆𝑤) . Then this approximate error is backpropagated to update parameters . The Raw Text is then Translated from

Native Language to English Language using [12]. After this step, the proposed approach uses Moses Tokenizer [19] which is

especially good in terms of Tokenizing phrases/words without Punctuation than other existing Tokenizers. Each Token is then

Passed to Google Location Offline Database to check if it is a location, if it is location, we categorize such Tokens as KeyWord

which are most likely to generate Location Information else we reject the token. The reason behind Tokenizing text is to remove

ambiguity for name of place which can exist in two or more different states whose regional language is different. For example,

“Raniganj Bazar” is a place in the State Uttar Pradesh, India but “Raniganj” is also a Place in state West Bengal, India where Re-

gional Languages are Hindi and Bengali respectively. In such cases, we look for secondary information which is discussed later in

Section 3.4

राजकीय प्राथमिक मिद्यालय भीिों की

ढाणी खाटािास, पं. स. लूणी,जोधपुर। _ జై కిసాన్ రైతుసంఘభవనము గ్రా.

శ్రీరాములపల్లి మం. కమలాపర్, జి.,

కరీంనగర్ 4031 " రిజిసరీ న ం.

Luni Jodhpur

State Primary School, Bhimon

Dhanani Khatavas, Pt. Luni,

Jodhpur

- Jai Kisan Farmers Association G. SriramuluPalli mn. Kamalapar,

G., Karimnagar 4031 "Registry no.

Sriramulupalli Kamalapar

Karimnagar

(a) Input Natural Scene Images

(b) Text Region Segmented Image

(c) Text Recognition Result using DTRN[18]

(d) Translation Result of Native Language to English

(e) Key Word Segmentation from Translated Text having Location Information

Fig. 7. Representation of how Key Location Word is Extracted from Segmented Text using DTRN

PHDENIX CEIP Jägän como PHILIPS

DEVGAN MOTORS WE DEALS IN :

GENUINE AUTO LAMPS & OTHER PART

4822/2, Guru Nanak Auto Market, K.Gate,

Delhi-e Ph.: 23948091, Mob.: 9891010172,

9990299425


DEVGAN MOTORS WE DEALS IN : GENUINE AUTO LAMPS & OTHER PART

4822/2, Guru Nanak Auto Market, K.Gate,

Delhi-e Ph.: 23948091, Mob.: 9891010172,

9990299425

Delhi

Hindi Natural Scene Telugu Natural Scene English Natural Scene

Khatavas

The illustration in Fig 7 explains the importance of Text Recognition in key word extraction from the Segmented Image . So seg-

mentation and text recognition plays an important role in detection of Regional Language . Moses Tokenization is particularly

Important in Conversion of Fig 7(d) to Fig 7(e) since any wrong key word may result in extraction of Different Regional Lan-

guage . The step by step process from the bounding box information of the Input Image to Key Word Extraction including the

intermediate steps are clearly illustrated in Fig 8

Softmax

X1 X2 X3 Xn

L

S

T

M

L

S

T

M

L

S

T

M

L

S

T

M

L

S

T

M

L

S

T

M

L

S

T

M

L

S

T

M

P1 P2 P3 pn

S1 S2 sn

PHDENIX CEIP Jägän como PHILIPS DEVGAN

MOTORS WE DEALS IN : GENUINE AUTO LAMPS &

OTHER PART 4822/2, Guru Nanak Auto Market, K.Gate,

Delhi-e Ph.: 23948091, Mob.: 9891010172, 9990299425 Language Detector and Translator

PHDENIX CEIP Jägän como PHILIPS DEVGAN

MOTORS WE DEALS IN : GENUINE AUTO LAMPS &

OTHER PART 4822/2, Guru Nanak Auto Market, K.Gate,

Delhi-e Ph.: 23948091, Mob.: 9891010172, 9990299425

PHDENIX CEIP JAGAN

990299425

Moses Tokenizer

Google Location Module

Delhi

Convolution

Layers

Recurrent Layers

Key Word

Extraction

High Level CNN

Features

CTC Layer

Bi-LSTM

Tokenizing Strings

KeyWord

MaxOut CNN

Fig. 8. Pictorial Representation of Text Recognition and KeyWord Extraction Pipeline Architecture

Bounded Box Text

Information

9

3.4 Regional Language Extraction from Location Information

The key words obtained from previous section is used for detecting regional language. For each Token we query the Database

collected from Post Office Database from Government of India. Our choice of database is Post Office because Post Office is pre-

sent in every corner of India and since we are targeting regional language we are covering entire Country. The key words are que-

ried in the Database where the Primary Key is Composite Key of (Division Name, Taluk, Pin code), Taluk is generally a Adminis-

trative centre controlling several villages, hence along with Pin code it determines a region uniquely. This Database is referred to

as CSDB (City and States Database) [20] as shown in Table 1. Since we need location information, so we also need another data-

base which contains GPS information corresponding to such regions. For that we consider the Database [21] which contains City

ID, City, State, Longitude and Latitude as attributes and City Id as Primary Key. This database is referred to as LLDB (Latitude

and Longitude Database) as shown in Table 2.

Table 1. Representing the Post Office Database [20] CSDB with its Schema and Internal Structure

Table 2. Representing the GPS Database [21] LLDB with its Schema and Internal Structure

Using a Nested Query as shown below in Fig 9 we will fetch all possible tuples of combination (Latitude, Longitude, Pin Code)

from the Key Word Token individually for a particular Image and then we take the common tuple among all key words of the

Image to get the approximate location. In cases where we have ambiguity in Location as mentioned in previous Section 3.3, we

first observe the common tuple among Key words, if not found then we conclude that there is ambiguity and we are missing some

information. To solve this issue we propose a method to check secondary information like the detected language of Text Recogni-

tion, even if we cannot separate after this step then we combine adjacent tokens pairwise from left to right and re run the same

queries again to check common tuple. If we fail after this step, we conclude that Location cannot be found and some more opera-

tions or information needs to obtained. After this operation we perform reverse Geocoding [29] on (Longitude, Latitude) as input

to obtain the approximate location. Subsequently we query the location from Language Database (RLDB), which is created by us

with data collected from Wikipedia. Finally, we obtain the Regional Language set which is obtained for a particular input image.

We Geotag the Image with the Language Information and store it in a Database.

for i in keywords :

tuples[j++] = sql.execute(“ 𝑠𝑒𝑙𝑒𝑐𝑡 𝑡1. 𝑙𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝑡1. 𝑙𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒, 𝑡2. 𝑝𝑖𝑛𝑐𝑜𝑑𝑒 𝑓𝑟𝑜𝑚 𝐿𝐿𝐷𝐵 𝑡1 , 𝐶𝑆𝐷𝐵 𝑡2 𝑜𝑛 𝑡2. 𝐷𝑖𝑣𝑖𝑠𝑖𝑜𝑛_𝑁𝑎𝑚𝑒 = 𝑡1. 𝐶𝑖𝑡𝑦_𝑁𝑎𝑚𝑒 𝑤ℎ𝑒𝑟𝑒 𝑡2. 𝑇𝑎𝑙𝑢𝑘 = ′ " + 𝑖 + " ′ ; “) end

com_loc=common(tuples);

if (com_loc.size() > 0 )

place = reverse_geocode(com_loc.latitude,com_loc.longitude)

else

Solve Location Ambiguity

end

Fig. 9. Illustrates Pseudo Code for Extraction of Location from Keywords

Div_Name Pincode Taluk Circle Region District State

A-N Islands

744112

Port Blair

West Bengal

Calcutta HQ

South Andaman

Andaman and Nicobar Islands

Saharsa

852201

Kahara

Bihar

Bhagalpur

Saharsa

Bihar

City Id City_Name Latitude Longitude State

1

Port Blair

11.67 N’

92.76’

Andaman and Nicobar Islands

255

Saharsa

25.88 N’

86.59 E’

Bihar

4. Experimental Results

For Evaluating the Proposed Method we created a Complex Dataset comprising of Standard Datasets like ICDAR-15 , KAIST ,

NEOCR [14-17] . The input images contain Languages of three Classes namely English, Hindi and Telegu. Each class has images

ranging from daylight image to night image. Images were reportedly taken from good quality cameras, but to make the dataset

diverse, images captured from Smartphone cameras of both high and low resolutions are added. The Total Size of Datasets are

8500 images out of which 30% images are low resolution ( < 400 x 400 pixels) and rest are high resolution. Out of these 3700

words for English, 1900 for Telegu and 2900 for Hindi Language. From this dataset well shuffled 70 % Data are used for training

and remaining 30 % are used for testing the proposed CNN and RNN Models. To Measure the performance of Text recognition

we use standard measures namely Recall (R) , Precision ( P ) as defined in [22] . To evaluate the Performance of Text Detection

we use standard measures as defined in [23-24]. We also compare location accuracy with Google Maps to justify our offline loca-

tion searching. To prove usefulness of method in Text detection, we consider standard Deep Learning Methods, EAST[5], Tian et

al [1] and Shi et al. [4]. As per our knowledge, these methods work best for daylight, good resolution images. For effectiveness of

our approach and fair comparative Study, we pass the input images from the Database as it is including night images, low resolu-

tion, high resolution images. For text recognition, we will compare our method with Traditional Tesseract OCR [23], and

OCRopus [24]. We pass the Input Images Directly to OCR [23-24] for fair comparison. So text detection, Text recognition are the

key steps of this method. We also evaluate the Accuracy of Location found out using proposed method .

The following Figure represents the sample Database considered for our literature for Testing the efficacy of the Proposed Method

(a) Sample Images for Address in Hindi

(b) Sample Images for Address in Telegu

(c) Sample Images for Address in English

(d) Sample Images for Address in Mixed Indic Scripts

Fig. 10. Sanmple Images of the Databases

11

4.1 Evaluating Text Detection

Qualitative results of the proposed and existing methods for text detection in Natural Scene are shown in Fig 11. It is observed

that the Proposed Method performs well in comparison to existing methods [4 ,5]. Although Deep Learning solves many complex

problems like occlusion, font size, it requires good number of training samples, high quality Images. However, it performs badly

in low light images as depicted in Fig 11.1. (a), however proposed approach corrects any Input Image using Gamma Filter as ex-

plained earlier, hence detection rate is high. The advantage of proposed method over existing method of text detection lies in the

fact that it performs exceedingly well in low resolution and low light images.

Quantitative results of the proposed and existing methods for three databases are reported in Table 3 where it is noted that the

proposed method is best at precision , recall , F-measure in comparison to the all the existing methods. It is obvious since, the

Image Database (ICDAR-15 , KAIST , NEOCR [14-17] , Mobile Camera Images) contains images ranging from low resolution

to high resolution of various fonts and occlusion, less text information, Deep Learning based approaches sometimes fails to detect

text regions but with correction these images produce good results . EAST [5] scores 2nd best among the 3 deep learning based

methods and Seglink [4] performs poorly .

1(a) Night Image (b) Proposed Method (c) Seglink [9] (d) EAST [3]

2(a) Telegu Image (b) Proposed Method (c) Seglink [9] (d) EAST [3]

3(a) Hindi Image (b) Proposed Method (c) Seglink [9] (d) EAST [3]

Fig. 11. Represents Text Detection on Image Database for Proposed Approach and Existing Approaches

Table 3. Performance of the proposed and existing methods for text detection

4.2 Evaluating Text Recognition

Text Recognition in Natural Scene is a very interesting work. Due to orientation, sometimes feature extraction becomes quite

difficult. Traditional approaches like MSER Features work well in images having good lightning, high resolution Images but fail

significantly in low resolution and Night Images. Preprocessing improves the Accuracy but not to desired standards. Since we are

dealing with Location Extraction from Text, recognition of text is very important in this context, so partial text recognition is not

reliable. Huang et al [25] proposed MSER based Text Recognition but it detects simple text orientations, complex text orientation

does not give good results using MSER features. Roy et al. [26] proposed conditional random field model defined on potential

character locations and the interactions between them. This works well for individual words and not on region-based word recog-

nition, where words are interlinked, so missing on word leads to lossy information. Although many works used Tesseract OCR

[23] for text recognition but since it was originally developed for Document Based Extraction, performance significantly drops in

Natural Scene due to complex background. Similar scenario with OCRopus [24], it is also developed for document based extrac-

tion but it performs better than Tesseract OCR [23] as shown in Table 4 and Fig 12, since it uses preprocessing like Deskew ,

DeNoise , Demask before performing Text recognition. Alzaman et al [31] proposed a fast text recognition technique using a

combination of label embedding and attribute learning. Zhao et al [32] proposed a text recognition pipeline where he proposed a

modified VGG-Net architecture for character recognition .

So Deep Learning based method performs significantly better in such scenarios, although DTRN [18] works well for

individual words, according to our experiments it performed well in Region based word recognition. DTRN [18] uses MaxOut

CNN to extract High Level Features from Region of Interest and then Uses Recurrent Layers to connect all information sequen-

tially. Hence it performs better than existing Approaches. The metrics used for performance calculation are Precision and Recall

as mentioned earlier.

Table 4. Performance of the proposed and existing methods for text recognition

Methods

Technique

Precision

Recall

F-Score

Proposed Method Deep Learning 0.81 0.78 0.80

Seglink [4] Deep Learning 0.73 0.74 0.73

EAST [5] Deep Learning 0.77 0.75 0.76

Yin et. al [30] Image Processing 0.68 0.86 0.71

Methods

Precision

Recall

DTRN [18]

0.77 0.81

Tess-OCR [23]

0.32 0.46

OCRopus [24]

0.58 0.73

Almazan et al. [31] 0.67 0.71

Zhao et al [32] 0.73 0.66

13

Location Accuracy is also important since we are investigating Regional Language in this literature. Regional language

in India however does not change within few kilometers distance until and unless the Location is close to an International Border

or State Border, we consider those as exceptions. For Measuring Location, we consider the Key words as locations and search for

their GPS coordinates, since this approach is offline there is a high chance of error. So as a ground truth we search Googles

Online Maps with the Key words and compare with the Proposed Offline Version to calculate the Distance(d) between the Pro-

posed Point and Ground Truth Point. We run each image individually on 8500 images and calculate ‘d’ and finally we take aver-

age, so we found out average distance of proposed and ground truth point as a metric to justify the Accuracy. In our Experiments,

the Average Distance between Proposed Point and Ground Truth Point comes out as 11.87 Km as shown in Fig 13


DEVGAN MOTORS WE DEALS IN : GENUINE AUTO LAMPS & OTHER

PART 4822/2, Guru Nanak Auto Market,

K.Gate, Delhi-e Ph.: 23948091, Mob.:

9891010172, 9990299425

eta Aga" -011011111

MOT01111 IITO PHOEI7X

POISN: GENUINE AUTO

LAMPS & OTHER PART

822/2,Guru Nanak Auto Market,

K.Gat■e, IDoellikii-IE h.:

23948091, Mob.: 9891010172,

99902994.25

DEV6AN MOTO RS &

OTHER PART 822/2, Guru

Nanak Auto Market, K-Gate,

Delhi— Mob-:9891010172,

9990299425

_ జై కిసాన్

రైతుసంఘభవనము గ్రా.

శ్రీరాములపల్లి మం.

కమలాపర్, జి., కరీంనగర్

4031 " రిజిసరీ న ం.

డై క్ష )యీఠీక్ట్కూస ంగ్గేక్ఫర్రీ

తేవ " ‘” వసఠో స్థై (రి౦ఖు ఋ ఠో

గ్రా..శ్రేరాములఫల్సీ మరి..కమలాఫధ్రీ…

లో… క’ర్ష్వీం'జ్రాస్య ిం

\/\/ ‘ ‘ ఖో 1;‘ ~.1 .1, I ,~ ళ-

..~ £3()"€—

§C:i')Qjb(;‘)V’_(£)_ "Ann శి I

షే ,() A (’)‘(*f),/))() గ వాగు 0'

1-I

राजकीय प्राथमिक मिद्यालय

भीिों की ढाणी खाटािास, पं.

स.लूणी,जोधपुर।

…८५ अं चं ८१५" हूँ ३ '", ,८८१- -

__" ि ११८८- ' ८३ न ह ै- .. िृ ि - -

' " क्र "'ह ै३'.. . न्नके्व- - ‘ ' - 7 श्री ९ ८"

३ जंक्वें ३३… आमिफ्ला चं टाज्ञकीद्य३

प्राष्ठामिि. क विंद्यम्लय रों "3श्ि…

'…3 िृ "पांड्स 4…९1१८11५म्भ ष.

छंरुच्चािृह'ंणी झादटृम्पध

English Text DTRN [18] Tess-OCR [23] OCRopus [24]

Telegu Text DTRN [18] Tess-OCR [23] OCRopus [24]

Hindi Text DTRN [18] Tess-OCR [23] OCRopus [24]

Fig. 12. Represents Text Recognition Results for Indic Scripts using both Proposed and two existing methods

Fig. 13. Illustrates the Proposed Location obtained empirically from LLDB and Ground Truth Point obtained from Google Street on Indian

Map for Location Luni , Jodhpur

5. Conclusion And Future Work

This work highlights a novel method in extracting regional language from an Image containing any Address information. This is a

novel work of its kind where regional language is extracted from an Image without Internet (offline). The proposed method also

defines a unique Text detection approach where Images of low resolution and varying font size can be handled easily and text can

be extracted. However, it is noted that the proposed method is not robust to occlusion, severe blur and too small font as shown in

Fig 14(c) while capturing image from long distance camera hence address cannot be extracted. The proposed method also fails in

scenarios where Advertisements which contains location of office not in current state as represented in Fig 14(a) , here Picture

Location is Kolkata but proposed method also captures New Delhi but the regional Language of New Delhi and Kolkata are dif-

ferent. It also fails in scenarios like Fig 14(b) where the First word is a country which has different regional language . Currently

Text Recognition can detect English, Hindi and Telegu Scripts, in future the proposed work can extend to 26 Officially accepted

languages in India for unearthing the rarest regional Languages which is unknown to mankind. The work needs to further address

the location ambiguity using other information to detect accurate languages. The Query for selecting the Postoffice can also be

optimized based on Ground truth data. The application of this literature lies in the field of identification of unknown places in

India and also can be used in computer forensics .

(a) Multiple Location (b) Namesake False Location (c) Too Small Font

Fig. 14. Represents the Drawbacks of Proposed Method

15

References

1. Z. Tian, W. Huang, T. He, P. He and Y. Qiao, “ Detecting Text in Natural Image with Connectionist Text Proposal Network”, In Proc.

ECCV, 2016, pp 56-72.

2. I. B. Ami, T. Basha and S. Avidan, “Racing bib number recognition”, In Proc. BMVC, 2012.

3. P. Shivakumara, R. Raghavendra, L. Qin, K. B. Raja, T. Lu and U. Pal, “A new multi-modal approach to bib/text detection and recognition

in Marathon images”, Pattern Recognition, Voil. 61, 2017, pp 479-491.

4. B. Shi, X. Bai and S. Belongie,"Detecting Oriented Text in Natural Images by Linking Segments", In Proc. CVPR, 2017, pp 3482-3490.

5. X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, and W. He, “EAST: An Efficient and Accurate Scene Text Detector”, In Proc. CVPR, 2017,

pp 2645-2651.

6. H. Lee and C. Kim, “Blurred image region detection and segmentation”, In Proc. ICIP, 2014, pp 4427-4431.

7. H. Zhang, K. Zhao, Y. Z. Song, and J. Guo, “Text extraction from natural scene image: A survey,” Neurocomputing, vol. 122, pp. 310–323,

2013.

8. H. Raj and R. Ghosh, “Devanagari Text Extraction from Natural Scene Images,” Int. Conf. Adv. Comput. Informatics, pp. 513–517, 2014.

9. A. Buades, B. Coll, and J.-M. Morel, “Non-Local Means Denoising,” Image Process. Line, vol. 1, pp. 208–212, 2011.

10. J. Chen, J. Benesty, Y. A. Huang, and S. Doclo, “New Insights Into the Noise Reduction Wiener Filter,” IEEE Trans. Audio. Speech. Lang.

Processing, vol. 14, no. 4, pp. 1218–1234, 2006.

11. Y. Wu, P. Shivakumara, T. Lu, C. L. Tan, M. Blumenstein and G. H. Kumar, “Contour restoration of text components for recognition in

video/scene images”, IEEE Trans. IP, pp 5622-5634, 2016.

12. https://github.com/libindic/indic-trans

13. Regional Language List ( https://en.wikipedia.org/wiki/Regional_language )

14. https://nrsc.gov.in/hackathon2018/

15. http://www.iapr-tc11.org/mediawiki/index.php?title=KAIST_Scene_Text_Database

16. http://www.iapr-tc11.org/mediawiki/index.php?title=NEOCR:_Natural_Environment_OCR_Dataset

17. N. Sharma, R. Mandal, R. Sharma, U. Pal, and M. Blumenstein, “ICDAR2015 Competition on Video Script Identification (CVSI 2015),”

Proc. Int. Conf. Doc. Anal. Recognition, ICDAR, vol. 2015–November, no. Cvsi, pp. 1196–1200, 2015.

18. P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang, “Reading Scene Text in Deep Convolutional Sequences,” 2015.

19. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl

20. https://data.gov.in/catalog/all-india-pincode-directory

21. https://gist.github.com/gsivaprabu/5336570

22. A. Kumar, “An Efficient Approach for Text Extraction in Images and Video Frames Using Gabor Filter,” Int. J. Comput. Electr. Eng., vol.

6, no. 4, pp. 316–320, 2014.

23. https://github.com/tesseract-ocr/tesserac

24. https://github.com/tmbdev/ocropy

25. Xiaoming Huang, Tao Shen, Run Wang, and Chenqiang Gao, “Text detection and recognition in natural scene images,” 2015 Int. Conf. Es-

tim. Detect. Inf. Fusion, no. ICEDlF, pp. 44–49, 2015.

26. U. Roy, A. Mishra, K. Alahari, and C. V Jawahar, “Scene Text Recognition and Retrieval for Large Lexicons,” Accv2014, pp7–10,2014.

27. J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm,” Appl. Stat., vol. 28, no. 1, p. 100, 197

28. https://pypi.org/project/goslate/

29. https://pypi.org/project/reverse_geocode

30. X. Yin, X. Yin, K. Huang, and H. Hao, “Robust Text Detection in Natural Scene Images,” vol. 36, no. 5, pp. 970–983, 2014.4

31. A. Gordo, A. Forn, E. Valveny, and J. Almaz, “Word Spotting and Recognition with Embedded Attributes,” vol. 36, no. 12, pp. 2552–2566,

2014.

32. H. Zhao, Y. Hu, and J. Zhang, “Character Recognition via a Compact Convolutional Neural Network,” 2017.

https://pypi.org/project/reverse_geocode

offline extraction of indic regional language from natural scene … · 2018-07-10 ·...

Documents