offline extraction of indic regional language from natural scene … · 2018-07-10 ·...
TRANSCRIPT
![Page 1: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/1.jpg)
1
Offline Extraction of Indic Regional Language from Natural Scene Image using Text
Segmentation and Deep Convolutional Sequence
1Sauradip Nag,
1Pallab Kumar Ganguly,
1Sumit Roy,
1Sourab Jha,
1Krishna Bose ,
1Abhishek Jha and
2Kousik Dasgupta
1Kalyani Government Engineering College, Kalyani, Nadia, India 2Faculty of Computer Science and Engineering, 1Kalyani Government Engineering College, Kalyani, Nadia, India
[email protected], [email protected] , [email protected] ,
[email protected] , [email protected] , [email protected] ,
Abstract : Regional language extraction from a natural scene image is always a challenging proposition due to its dependence
on the text information extracted from Image. Text Extraction on the other hand varies on different lighting condition, arbitrary
orientation, inadequate text information, heavy background influence over text and change of text appearance. This paper pre-
sents a novel unified method for tackling the above challenges. The proposed work uses an image correction and segmentation
technique on the existing Text Detection Pipeline an Efficient and Accurate Scene Text Detector (EAST). EAST uses standard
PVAnet architecture to select features and non maximal suppression to detect text from image. Text recognition is done using
combined architecture of MaxOut convolution neural network (CNN) and Bidirectional long short term memory (LSTM) net-
work. After recognizing text using the Deep Learning based approach, the native Languages are translated to English and to-
kenized using standard Text Tokenizers. The tokens that very likely represent a location is used to find the Global Positioning
System (GPS) coordinates of the location and subsequently the regional languages spoken in that location is extracted. The pro-
posed method is tested on a self generated dataset collected from Government of India dataset and experimented on Standard
Dataset to evaluate the performance of the proposed technique. Comparative study with a few state-of-the-art methods on text
detection, recognition and extraction of regional language from images shows that the proposed method outperforms the existing
methods.
Keywords: Segmentation , Text Recognition, Language extraction , Deep Learning , Bi-Directional LSTM , Tokenization,
Neural Network.
1. Introduction
As internet and use of social media increases, the rise of easily accessible multimedia data, such as video, images and text created
unprecedented information overflow. In contrast to this, information extraction from Natural Scene is still a daunting task and
currently active area of research mainly because of the complexities faced during extraction. India being a vast country has a wide
span of uncovered regions and various colloquial languages. Currently India has twenty six (26) different regional languages,
many of these languages are native languages of mainly rural areas of India. So, extraction of these regional language from rural
area is itself a challenging task, since the rural areas generally do not contain well defined address. Recent studies explored the
extraction of Devanagari Scripts from text [8] but these do not give information about the specific regions where these languages
are spoken, or the approximate location of the image. Modern applications focus on using metadata of an image to find the ground
location of the image and from there find the accurate regional language. However, if images are captured using digital single-lens
reflex (DSLR) or Digital Camera then the images may not be geotagged, so in such cases it will fail to get location. Therefore, it
may be concluded that it is very much essential to extract location information from natural scene image offline, as it will be nov-
el in its kind. Thus, the work in this paper focusses on extraction of location from natural scene image having address information
offline and subsequently finding regional language of that region. This is interesting since we are using a text candidate segmenta-
tion, which outperforms the existing methods in performance in both low and high resolution. We are doing the entire process
offline by training the proposed text recognition model for different regional languages. A sample collection of images has been
generated from the Government of India dataset [14] for experimental study. Some sample images are shown in Fig. 1. The main
contribution of proposed work is
Using image correction for low quality images and using existing deep learning method to improve performance.
![Page 2: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/2.jpg)
Using deep convolutional sequence to detect Indic languages as shown in Fig 1.
Detecting Location from a Image having Address Information in native languages
Detecting the regional languages from its address information.
The rest of the paper is laid out as follows: In Section 2, we discuss some related works regarding detection of text and text recog-
nition using both traditional and Deep Learning based methods and finally extraction of language from it. In Section 3, the pro-
posed text segmentation and connectionist text proposal network for Indic languages have been described alongside regional lan-
guage extraction from location information. In Section 4, we discuss the details of dataset, the proposed experimental setup, com-
parison of proposed and existing methods and results in details. Section 5 points out the efficacy and drawbacks of the proposed
approach, some direction towards future work and concluding with suggestive applications.
2. Related Work
Since natural scene image is occluded by various lightning condition, text orientation, inappropriate text information, so extrac-
tion of text candidate region from the natural scene image is very important in this literature. Tian et al. [1] proposed detecting
text in natural image with connectionist text proposal network. The method detects a text line in a sequence of fine-scale text pro-
posal directly convolutional feature maps. This method explores context information of the image for successful text detection.
The method is reliably on multi-scale, multi-language text. Shi et al. [4] proposed detecting oriented text in natural images by
linking segments. The idea here is that the text is decomposed into segments and links. A segment is an oriented text box covering
part of text line while links connects two adjacent segments, indicating that they belong to the same text line. To achieve this, the
method explore deep learning concept. Zhou et al. [5] proposed an efficient and accurate scene text detector. The method predicts
text line without detecting words or characters based on deep learning concepts, which involves designing the loss of functions
and neural network architecture. The method works well for arbitrary orientations. He et al [18] proposed Deep Text Recurrent
Network (DTRN), which generates ordered high level sequence from a word image using Maxed out CNN and then uses Bi-
LSTM to detect the generated CNN sequence. These works on text detection on natural scene images considers high-resolution,
high-contrast images. These methods directly or indirectly explore connected component analysis and characteristics at character
level for successful text detection of arbitrary orientation, multi-language, font face, font size etc. However, the performance of
the methods degrades when the image contains both high and low contrast text. To overcome the problem of low contrast and
resolution, the proposed method is effective as the Image with poor quality is fixed using Image Correction techniques and then
uses EAST[5] to detect Text Boxes .It also uses convex hull to generate orientation wise correct text segmentation. The work in
this paper presents a new method of text candidate region selection and segmentation by exploring. This work explores the prob-
lem of regional language detection from both low quality and high quality images. Since Deep Learning based approaches per-
form better than conventional approaches, so Neural Network based Architecture is used for both Text Detection and Text Recog-
nition which is elaborated in details in the upcoming sections.
Hindi
English
Tamil
Urdu
Bengali
Fig. 1. Represents Sample Images in Different Languages having Address Information
![Page 3: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/3.jpg)
3
3. The Proposed Methodology
In this work a RGB image having at least 1 line of address Information is taken as input. This input image is then preprocessed for
image correction using techniques like applying Gaussian Blur to reduce noise and applying Gamma Filter for enhancing contrast
for night images. Then the Meta information is checked for GPS coordinates , if it is present then we proceed to find Regional
Language.If not present we pass this corrected Image into EAST [5] Text Detction Pipeline which uses standard PVAnet and Non
Maximum Supression Layer as a part of its Architecture. This returns the Bounding Box over probable Text Region . We binarize
the image and perform periodic growing of the boxes on both sides until neighboring boxes overlap. In this way we make sure
that the Adjacent Information is not lost . Still due to orientation some important information may be lost , so for that we apply
Polygonal Convex Hull to correct the Segmentation and subsequently mask out the Color Image . After Segmenting the Text ROI
from the Image, we use DTRN [18] since Address information can be in any font, we chose a method that is not dependent on
Nature of Text. Motivated by the work of He et al [18] we use MaxOut CNN to get High Level Features for Input Text instead of
grueling task of Character Segmentation from input since it is not Robust and Trustable. After this step as given in Figure 2, these
high-level features are passed into Recurrent Neural Network architecture .Since we need address information which is quite sen-
sitive to the context of a line , it may suffer from vanishing gradient problem . So Bi-LSTM is used since it solves vanishing gra-
dient problem and it can learn meaningful long-range interdependencies which is useful in our context especially when our main
task is to retrieve Address Information from Text Sequence. A Connectionist Temporal Classification (CTC) Layer has been add-
ed to the output of LSTM Layers for sequential labelling. We have trained this architecture for Hindi and Telugu which is our
contribution. We pass the same input image for all the 3 language Architecture (English, Hindi, Telugu). The softmax Layer
which is added in the End of MaxOut CNN Layers returns confidence scores for detection of High Level Features. We pass high
level features of that Architecture whose softmax score is above some threshold. After recognition, we detect the language using
Google’s Offline Language Detector Module [28] and subsequently translate to English for better recognition rate.
The proposed Approach then uses Moses Tokenizer [19] to tokenize the text as this tokenizer tokenizes the words without punctu-
ation from raw translated text. Each of these tokens/words are queried individually on Post Office Database collected from Gov-
School
,
Image Database Image Prepocessing
for Correction
ROI
Selection
ROI Merging and
Segmentation
राजकीय प्राथमिक
मिद्यालय भीिों की
ढाणी खाटािास, पं.
स. लूणी,जोधपुर।
State Primary School,
Bhimon Dhanani
Khatavas, Pt. Luni,
Jodhpur
Automatic
Language Detection
and Translation
Stat
e
Prima-
ry
School
,
Regional Language
Database
Bhimon
, Dhanani
State Khatavas,
State Pt
Sta
Luni
Jodhpur
Is words[i] a
Location ?
Jodhpur
Bhimon ,
Dhanani State
Primary State
Khatavas, State
Pt Sta
State
Latitude : 26.29
Longitude : 73.02
Pincode : 342802
Offline Reverse
Geocoder
Languages Spoken :
Hindi , Marwari
Location
Database
La
ng
ua
ge
Ex
tra
cto
r E
ng
ine
Text
Recognition
Fig. 2.: Framework of the Proposed Method
Luni
State
Accepted Words
Rejected Words
Text Tokenization
Geotag Images
Save to
Database
Query Query
Yes
No
i = i + 1
Meta-Data from Images
![Page 4: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/4.jpg)
ernment Of India to obtain the GPS coordinates of all the common post office of the locations found using Tokenization. After
obtaining (Pincode, Longitude, and Latitude) for a particular image, we use reverse geocoding [29] and database RLDB created
by collecting Regional Language Information found in [13]. From this, we obtain the Regional Language Spoken for the particu-
lar location found in Input Image and finally we Geotag the Images and save them. The complete framework of the proposed
method can be seen in Fig 2.
3.1 Image Correction
Any picture from natural scene can contain certain amount of noise, blur and varying lightning condition, which may affect any
method for text candidate selection. So extra care must be taken to ensure that the performance must not drop irrespective of the
quality of picture if it has at least one address information. So, to correct the Image before text detection , the input image is pre-
processed. Bright or Dark Image is identified using Histogram of Intensity.If the input Image is Night Image(dark) then a Gamma
Filter with Gamma Value 2.5 is applied. The value 2.5 is empirical and is obtained after proper experimentation. So, this gamma
value is used to correct it as shown in Fig 3(b). Noise is removed from color image by using non-local de-noising method [9] and
unsupervised Weiner Filter [10] is used to de-blur the de-noised image as shown in Fig 3(c).
3.2 Text Detection and Candidate Segmentation
Text Detection in Natural Scene has been an area of research for some years in the Recent Past . Conventional Approaches Rely
on manually designed features like Stroke Width Transform (SWT) , MSER based methods which identifies character candidates
using Edge Detection or extremal region Extraction or contour features .Deep Learning Methods outperform these methods as
these methods fall behind in terms of robustness and accuracy .
Text Detection : Since Zhou et al. [5] proposed EAST which predicts text line without detecting words or characters based on
deep learning concepts , this approach is suitable for Extracting Information like Text in Multiple Fonts . EAST[5] also imple-
mented Geometry Map Generation which can Detect Text in Arbitrary Orientation without losing Text information with the help
of rotating box (RBOX ) and Quadrangle (QUAD). But EAST[5] fails for Low Quality Images since Trained Model used High
Quality Images from Standard Datasets . As mentioned in Section 3.1 , the Low Quality images are improved and fed to the Text
Detection Pipeline . The pipe line proposed by Zhou et al. [5] consists of only 2 stages namely Fully Convolutional Neural Net-
work (FCN) and Non-Maximum Suppression ( NMS ) . The FCN internally uses a Standard Neural Network Architecture called
PVAnet which consist of 3 internal Layers : Feature Extraction Layer , Feature Merging Layer and Geometry Map Generation for
Bounding Box which is after the Output Layer as depicted in Fig 4 .The input image to this step is the Uncorrected Low/High
Quality Images . The Input image is then resized and fed to the Neural Network Pipeline where it extracts the Feature relevant to
text in 1st Layer , 2
nd Layer it merges the features and extract possible text region mapping using Geometry Map Generation losses
. After This step the NMS Layer selects the maps which are only for text and marks a Bounded Box around it as shown in Fig
4(c).
(a) Input Image (b) Gamma Corrected Image (c) Noise and Blur Corrected Image
Fig. 3. Depicts how Input Image is preprocessed before Text Detection
![Page 5: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/5.jpg)
5
Text Segmentation : Since the Architecture mentioned Above selects individual Boxes it may be difficult to extract location
information . Since each location connected by ‘-‘(dash) may represent a location and EAST alone will only detect the two con-
nected words individually which may create ambiguity in location extraction . Hence we need to process the Image and segment
out the Text Candidate Region. The Proposed Pipeline thus contains 3 key steps : Image Correction , Text Detection , Merging
and Segmentation as shown in Fig 5 . Input Image is corrected by maintaining Intensity , Blur Correction and Denoising . This
image is then passed onto the Existing EAST[5] Pipeline as mentioned in Fig 4 . This FCN takes corrected images as input and
outputs dense per pixels prediction of Text Lines . This eliminates the time consuming Intermediate Steps which makes it a fast
text detector. The post processing steps include thresholding and NMS on predicted Geometric Shapes . NMS is a post processing
algorithm responsible for merging all detections that belong to the same object . The output of this stage is bounded Box on the
text regions as shown in Fig 5(d).
After Formation of Bounding Boxes over the Image as shown in Fig 6(c) the box sizes are increased in a periodic manner on both
the Sides until it overlaps with neighboring bounding box . In this way , we make sure that the Neighboring Textual Information is
not lost . This is represented in Fig 6(d). After this operation, we observe that the segmented image is not uniform (in Fig 6.(d) )
and has lost Information which may affect the overall results since address information is interlinked and order of text matters. So
the proposed approach first binarizes the incorrect segmentation and then uses Polygon Convex Hull as in Fig 6.(f) to correct the
Image and the text information which was lost due to orientation , and box merging are recovered . Finally, we masked out the
region from Color Input Image as shown in Fig 6(g) and Fig 5 (e) . Now this will be the input to the Text Recognition Stage .
Feat
ure
Ext
ract
or
Feat
ure
Mer
ger
Ou
tpu
t La
yer
Geo
met
ry M
ap
NM
S La
yer
(a) Input (b) PVAnet Architecture (c) Output
Fig. 4. Architecture of Text Detection Pipeline of EAST[5]
Imag
e C
orr
ecti
on
Segm
enta
tio
n
(a) Input (b) Corrected (d) Detected Text (e) Output (c) EAST [5]
Fig. 5. Architecture of Proposed Text Detection Pipeline
Existing EAST
Architecture
![Page 6: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/6.jpg)
Now the Text Recognition Method will not search the entire Image for text instead it will search on a smaller region which makes
this method quite Fast and Accurate . The whole process of Execution takes less than 15 seconds for images having size of 2000 x
1500 on a Intel Core i5-5200U CPU @ 2.20GHz Processor with 8GB Ram . Hence it segments out the Text in a time which is
almost equivalent to the time taken by EAST[5] to process a image individually .
3.3 Text Recognition and Key Word Extraction
Detecting Text from Natural Scene is a daunting task due to the changing background in different images and other external fac-
tors. Several works have been done on detecting text from Image using both traditional methods and more recently with Deep
learning methods. But in most cases the Deep Learning methods outperforms the traditional methods due to the complexity and
orientation. However, it must be noted that the images used for training and testing the Deep Learning Based Methods require a
high DPI, so images must not be of low resolution. The modern approaches use Deep Learning Based Text Detection and Text
Recognition Pipeline [5]. However, these methods fail to detect a candidate region in low resolution images. But the proposed
approach can detect text candidate in both high and low-resolution images better than other existing approaches and for text
recognition purpose we have used Deep Learning Based Text Recognition Method as proposed by Tian et al. [1]. Since the meth-
od with connectionist text proposal network detects a text line in a sequence of fine-scale text proposal directly convolutional
feature maps. Unlike the problem of OCR, scene text recognition required more robust features to yield results comparable to the
transcription-based solutions for OCR. A novel approach combining the robust convolutional features and transcription abilities of
recurrent neural networks as introduced by [37]. DTRN [18] uses a special CNN called MaxOut CNN to create High Level Fea-
tures and RNN is used to decode the Features into word Strings. Since maxout CNN is best suited for character classification we
use this on entire word image instead of whole image approach. We pass the bounding box text information of the boxes which lie
in the Segmented Text Region shown in Fig 6(g) to this MaxoutCNN for extracting High Level Features for Extraction . The Fea-
tures are in the form of sequence { 𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛 } . These CNN Features { 𝑥1 , 𝑥2 , 𝑥3 , … , 𝑥𝑛} are fed to the Bi-
Directional LSTM in each time sequence , and produces the output 𝑝 = { 𝑝1 , 𝑝2 , 𝑝3 , … , 𝑝𝑛}. Now the length of input se-
quence X and length of output sequence p is different , hence a CTC Layer is used which follows the equation :
𝑆𝑤 ≈ 𝛽(arg max𝜋 𝑃(𝜋|𝑝)) (1)
Where 𝛽 is projection which removes repeated labels and non-character labels and 𝜋 is the approximate optimized path with max-
imum probability among LSTM outputs . For example , 𝛽(− − 𝑑𝑑𝑑 − 𝑒𝑒 − 𝑙 − ℎℎ − − − 𝑖 −) = 𝑑𝑒𝑙ℎ𝑖 , hence it produces se-
quences 𝑆𝑤 = {𝑠1 , 𝑠2 , 𝑠3 , … , 𝑠𝑛} which are of same length as Ground Truth Input . India being a vast Country has a diverse
spread of Language [13], since training data collection for these languages are challenging, for the present we have focused to use
(a) Input Image (b) Corrected Image (c) Bounded Box Text Image (d) Text Merged Image
(e) Binarized Merged Image (f) Convex Hull Applied (g) Final Output Image
Fig. 6. Represents Step by Step Process of Detecting Text from Input Image and Segmenting it for Text Recognition Step
![Page 7: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/7.jpg)
7
English, Hindi and Telegu Text Recognition in this work. For training the CNN part of DTRN [18] we collected individual char-
acter Images from various datasets [14-17] and we trained the Maxout CNN individually for English, Hindi and Telegu. For train-
ing samples, we collected individual character from [15], [16], [17] for English Texts, Hindi and Tamil were obtained from [14],
[17] and publicly available images in Google. In total we trained characters of multiple fonts, each class has 10 different sample
of varying width size and style which were collected from around 2000 Images for English, 1100 for Telegu and 1800 for Hindi
Scripts. For training RNN part, we took the images from the training samples [14-17] and cropped out word images individually
and collected 3700 words for English, 1900 for Telegu and 2900 for Hindi Language. Although, He et. al. [18] proposed DTRN
for word recognition, we have extended this to sentence recognition with good accuracy. The Hyperparameters used are : Learn-
ing Rate fixed to 0.0001 and momentum of 0.9. The Loss Function used in this Recurrent Layers is defined in Equation(2) as
shown below :
𝐿(𝑋, 𝑆𝑤) ≅ − ∑ 𝑙𝑜𝑔(𝑃(𝜋𝑡|𝑋))𝑇𝑡=1 (2)
where 𝜋 ∈ 𝛽−1(𝑆𝑤) . Then this approximate error is backpropagated to update parameters . The Raw Text is then Translated from
Native Language to English Language using [12]. After this step, the proposed approach uses Moses Tokenizer [19] which is
especially good in terms of Tokenizing phrases/words without Punctuation than other existing Tokenizers. Each Token is then
Passed to Google Location Offline Database to check if it is a location, if it is location, we categorize such Tokens as KeyWord
which are most likely to generate Location Information else we reject the token. The reason behind Tokenizing text is to remove
ambiguity for name of place which can exist in two or more different states whose regional language is different. For example,
“Raniganj Bazar” is a place in the State Uttar Pradesh, India but “Raniganj” is also a Place in state West Bengal, India where Re-
gional Languages are Hindi and Bengali respectively. In such cases, we look for secondary information which is discussed later in
Section 3.4
राजकीय प्राथमिक मिद्यालय भीिों की
ढाणी खाटािास, पं. स. लूणी,जोधपुर। _ జై కిసాన్ రైతుసంఘభవనము గ్రా.
శ్రీరాములపల్లి మం. కమలాపర్, జి.,
కరీంనగర్ 4031 " రిజిసరీ న ం.
Luni Jodhpur
State Primary School, Bhimon
Dhanani Khatavas, Pt. Luni,
Jodhpur
- Jai Kisan Farmers Association G. SriramuluPalli mn. Kamalapar,
G., Karimnagar 4031 "Registry no.
Sriramulupalli Kamalapar
Karimnagar
(a) Input Natural Scene Images
(b) Text Region Segmented Image
(c) Text Recognition Result using DTRN[18]
(d) Translation Result of Native Language to English
(e) Key Word Segmentation from Translated Text having Location Information
Fig. 7. Representation of how Key Location Word is Extracted from Segmented Text using DTRN
PHDENIX CEIP Jägän como PHILIPS
DEVGAN MOTORS WE DEALS IN :
GENUINE AUTO LAMPS & OTHER PART
4822/2, Guru Nanak Auto Market, K.Gate,
Delhi-e Ph.: 23948091, Mob.: 9891010172,
9990299425
PHDENIX CEIP Jägän como PHILIPS
DEVGAN MOTORS WE DEALS IN : GENUINE AUTO LAMPS & OTHER PART
4822/2, Guru Nanak Auto Market, K.Gate,
Delhi-e Ph.: 23948091, Mob.: 9891010172,
9990299425
Delhi
Hindi Natural Scene Telugu Natural Scene English Natural Scene
Khatavas
![Page 8: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/8.jpg)
The illustration in Fig 7 explains the importance of Text Recognition in key word extraction from the Segmented Image . So seg-
mentation and text recognition plays an important role in detection of Regional Language . Moses Tokenization is particularly
Important in Conversion of Fig 7(d) to Fig 7(e) since any wrong key word may result in extraction of Different Regional Lan-
guage . The step by step process from the bounding box information of the Input Image to Key Word Extraction including the
intermediate steps are clearly illustrated in Fig 8
Softmax
X1 X2 X3 Xn
L
S
T
M
L
S
T
M
L
S
T
M
L
S
T
M
L
S
T
M
L
S
T
M
L
S
T
M
L
S
T
M
P1 P2 P3 pn
S1 S2 sn
PHDENIX CEIP Jägän como PHILIPS DEVGAN
MOTORS WE DEALS IN : GENUINE AUTO LAMPS &
OTHER PART 4822/2, Guru Nanak Auto Market, K.Gate,
Delhi-e Ph.: 23948091, Mob.: 9891010172, 9990299425 Language Detector and Translator
PHDENIX CEIP Jägän como PHILIPS DEVGAN
MOTORS WE DEALS IN : GENUINE AUTO LAMPS &
OTHER PART 4822/2, Guru Nanak Auto Market, K.Gate,
Delhi-e Ph.: 23948091, Mob.: 9891010172, 9990299425
PHDENIX CEIP JAGAN
990299425
Moses Tokenizer
Google Location Module
Delhi
Convolution
Layers
Recurrent Layers
Key Word
Extraction
High Level CNN
Features
CTC Layer
Bi-LSTM
Tokenizing Strings
KeyWord
MaxOut CNN
Fig. 8. Pictorial Representation of Text Recognition and KeyWord Extraction Pipeline Architecture
Bounded Box Text
Information
![Page 9: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/9.jpg)
9
3.4 Regional Language Extraction from Location Information
The key words obtained from previous section is used for detecting regional language. For each Token we query the Database
collected from Post Office Database from Government of India. Our choice of database is Post Office because Post Office is pre-
sent in every corner of India and since we are targeting regional language we are covering entire Country. The key words are que-
ried in the Database where the Primary Key is Composite Key of (Division Name, Taluk, Pin code), Taluk is generally a Adminis-
trative centre controlling several villages, hence along with Pin code it determines a region uniquely. This Database is referred to
as CSDB (City and States Database) [20] as shown in Table 1. Since we need location information, so we also need another data-
base which contains GPS information corresponding to such regions. For that we consider the Database [21] which contains City
ID, City, State, Longitude and Latitude as attributes and City Id as Primary Key. This database is referred to as LLDB (Latitude
and Longitude Database) as shown in Table 2.
Table 1. Representing the Post Office Database [20] CSDB with its Schema and Internal Structure
Table 2. Representing the GPS Database [21] LLDB with its Schema and Internal Structure
Using a Nested Query as shown below in Fig 9 we will fetch all possible tuples of combination (Latitude, Longitude, Pin Code)
from the Key Word Token individually for a particular Image and then we take the common tuple among all key words of the
Image to get the approximate location. In cases where we have ambiguity in Location as mentioned in previous Section 3.3, we
first observe the common tuple among Key words, if not found then we conclude that there is ambiguity and we are missing some
information. To solve this issue we propose a method to check secondary information like the detected language of Text Recogni-
tion, even if we cannot separate after this step then we combine adjacent tokens pairwise from left to right and re run the same
queries again to check common tuple. If we fail after this step, we conclude that Location cannot be found and some more opera-
tions or information needs to obtained. After this operation we perform reverse Geocoding [29] on (Longitude, Latitude) as input
to obtain the approximate location. Subsequently we query the location from Language Database (RLDB), which is created by us
with data collected from Wikipedia. Finally, we obtain the Regional Language set which is obtained for a particular input image.
We Geotag the Image with the Language Information and store it in a Database.
for i in keywords :
tuples[j++] = sql.execute(“ 𝑠𝑒𝑙𝑒𝑐𝑡 𝑡1. 𝑙𝑎𝑡𝑖𝑡𝑢𝑑𝑒, 𝑡1. 𝑙𝑜𝑛𝑔𝑖𝑡𝑢𝑑𝑒, 𝑡2. 𝑝𝑖𝑛𝑐𝑜𝑑𝑒 𝑓𝑟𝑜𝑚 𝐿𝐿𝐷𝐵 𝑡1 , 𝐶𝑆𝐷𝐵 𝑡2 𝑜𝑛 𝑡2. 𝐷𝑖𝑣𝑖𝑠𝑖𝑜𝑛_𝑁𝑎𝑚𝑒 = 𝑡1. 𝐶𝑖𝑡𝑦_𝑁𝑎𝑚𝑒 𝑤ℎ𝑒𝑟𝑒 𝑡2. 𝑇𝑎𝑙𝑢𝑘 = ′ " + 𝑖 + " ′ ; “) end
com_loc=common(tuples);
if (com_loc.size() > 0 )
place = reverse_geocode(com_loc.latitude,com_loc.longitude)
else
Solve Location Ambiguity
end
Fig. 9. Illustrates Pseudo Code for Extraction of Location from Keywords
Div_Name Pincode Taluk Circle Region District State
A-N Islands
744112
Port Blair
West Bengal
Calcutta HQ
South Andaman
Andaman and Nicobar Islands
Saharsa
852201
Kahara
Bihar
Bhagalpur
Saharsa
Bihar
City Id City_Name Latitude Longitude State
1
Port Blair
11.67 N’
92.76’
Andaman and Nicobar Islands
255
Saharsa
25.88 N’
86.59 E’
Bihar
![Page 10: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/10.jpg)
4. Experimental Results
For Evaluating the Proposed Method we created a Complex Dataset comprising of Standard Datasets like ICDAR-15 , KAIST ,
NEOCR [14-17] . The input images contain Languages of three Classes namely English, Hindi and Telegu. Each class has images
ranging from daylight image to night image. Images were reportedly taken from good quality cameras, but to make the dataset
diverse, images captured from Smartphone cameras of both high and low resolutions are added. The Total Size of Datasets are
8500 images out of which 30% images are low resolution ( < 400 x 400 pixels) and rest are high resolution. Out of these 3700
words for English, 1900 for Telegu and 2900 for Hindi Language. From this dataset well shuffled 70 % Data are used for training
and remaining 30 % are used for testing the proposed CNN and RNN Models. To Measure the performance of Text recognition
we use standard measures namely Recall (R) , Precision ( P ) as defined in [22] . To evaluate the Performance of Text Detection
we use standard measures as defined in [23-24]. We also compare location accuracy with Google Maps to justify our offline loca-
tion searching. To prove usefulness of method in Text detection, we consider standard Deep Learning Methods, EAST[5], Tian et
al [1] and Shi et al. [4]. As per our knowledge, these methods work best for daylight, good resolution images. For effectiveness of
our approach and fair comparative Study, we pass the input images from the Database as it is including night images, low resolu-
tion, high resolution images. For text recognition, we will compare our method with Traditional Tesseract OCR [23], and
OCRopus [24]. We pass the Input Images Directly to OCR [23-24] for fair comparison. So text detection, Text recognition are the
key steps of this method. We also evaluate the Accuracy of Location found out using proposed method .
The following Figure represents the sample Database considered for our literature for Testing the efficacy of the Proposed Method
(a) Sample Images for Address in Hindi
(b) Sample Images for Address in Telegu
(c) Sample Images for Address in English
(d) Sample Images for Address in Mixed Indic Scripts
Fig. 10. Sanmple Images of the Databases
![Page 11: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/11.jpg)
11
4.1 Evaluating Text Detection
Qualitative results of the proposed and existing methods for text detection in Natural Scene are shown in Fig 11. It is observed
that the Proposed Method performs well in comparison to existing methods [4 ,5]. Although Deep Learning solves many complex
problems like occlusion, font size, it requires good number of training samples, high quality Images. However, it performs badly
in low light images as depicted in Fig 11.1. (a), however proposed approach corrects any Input Image using Gamma Filter as ex-
plained earlier, hence detection rate is high. The advantage of proposed method over existing method of text detection lies in the
fact that it performs exceedingly well in low resolution and low light images.
Quantitative results of the proposed and existing methods for three databases are reported in Table 3 where it is noted that the
proposed method is best at precision , recall , F-measure in comparison to the all the existing methods. It is obvious since, the
Image Database (ICDAR-15 , KAIST , NEOCR [14-17] , Mobile Camera Images) contains images ranging from low resolution
to high resolution of various fonts and occlusion, less text information, Deep Learning based approaches sometimes fails to detect
text regions but with correction these images produce good results . EAST [5] scores 2nd best among the 3 deep learning based
methods and Seglink [4] performs poorly .
1(a) Night Image (b) Proposed Method (c) Seglink [9] (d) EAST [3]
2(a) Telegu Image (b) Proposed Method (c) Seglink [9] (d) EAST [3]
3(a) Hindi Image (b) Proposed Method (c) Seglink [9] (d) EAST [3]
Fig. 11. Represents Text Detection on Image Database for Proposed Approach and Existing Approaches
![Page 12: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/12.jpg)
Table 3. Performance of the proposed and existing methods for text detection
4.2 Evaluating Text Recognition
Text Recognition in Natural Scene is a very interesting work. Due to orientation, sometimes feature extraction becomes quite
difficult. Traditional approaches like MSER Features work well in images having good lightning, high resolution Images but fail
significantly in low resolution and Night Images. Preprocessing improves the Accuracy but not to desired standards. Since we are
dealing with Location Extraction from Text, recognition of text is very important in this context, so partial text recognition is not
reliable. Huang et al [25] proposed MSER based Text Recognition but it detects simple text orientations, complex text orientation
does not give good results using MSER features. Roy et al. [26] proposed conditional random field model defined on potential
character locations and the interactions between them. This works well for individual words and not on region-based word recog-
nition, where words are interlinked, so missing on word leads to lossy information. Although many works used Tesseract OCR
[23] for text recognition but since it was originally developed for Document Based Extraction, performance significantly drops in
Natural Scene due to complex background. Similar scenario with OCRopus [24], it is also developed for document based extrac-
tion but it performs better than Tesseract OCR [23] as shown in Table 4 and Fig 12, since it uses preprocessing like Deskew ,
DeNoise , Demask before performing Text recognition. Alzaman et al [31] proposed a fast text recognition technique using a
combination of label embedding and attribute learning. Zhao et al [32] proposed a text recognition pipeline where he proposed a
modified VGG-Net architecture for character recognition .
So Deep Learning based method performs significantly better in such scenarios, although DTRN [18] works well for
individual words, according to our experiments it performed well in Region based word recognition. DTRN [18] uses MaxOut
CNN to extract High Level Features from Region of Interest and then Uses Recurrent Layers to connect all information sequen-
tially. Hence it performs better than existing Approaches. The metrics used for performance calculation are Precision and Recall
as mentioned earlier.
Table 4. Performance of the proposed and existing methods for text recognition
Methods
Technique
Precision
Recall
F-Score
Proposed Method Deep Learning 0.81 0.78 0.80
Seglink [4] Deep Learning 0.73 0.74 0.73
EAST [5] Deep Learning 0.77 0.75 0.76
Yin et. al [30] Image Processing 0.68 0.86 0.71
Methods
Precision
Recall
DTRN [18]
0.77 0.81
Tess-OCR [23]
0.32 0.46
OCRopus [24]
0.58 0.73
Almazan et al. [31] 0.67 0.71
Zhao et al [32] 0.73 0.66
![Page 13: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/13.jpg)
13
Location Accuracy is also important since we are investigating Regional Language in this literature. Regional language
in India however does not change within few kilometers distance until and unless the Location is close to an International Border
or State Border, we consider those as exceptions. For Measuring Location, we consider the Key words as locations and search for
their GPS coordinates, since this approach is offline there is a high chance of error. So as a ground truth we search Googles
Online Maps with the Key words and compare with the Proposed Offline Version to calculate the Distance(d) between the Pro-
posed Point and Ground Truth Point. We run each image individually on 8500 images and calculate ‘d’ and finally we take aver-
age, so we found out average distance of proposed and ground truth point as a metric to justify the Accuracy. In our Experiments,
the Average Distance between Proposed Point and Ground Truth Point comes out as 11.87 Km as shown in Fig 13
PHDENIX CEIP Jägän como PHILIPS
DEVGAN MOTORS WE DEALS IN : GENUINE AUTO LAMPS & OTHER
PART 4822/2, Guru Nanak Auto Market,
K.Gate, Delhi-e Ph.: 23948091, Mob.:
9891010172, 9990299425
eta Aga" -011011111
MOT01111 IITO PHOEI7X
POISN: GENUINE AUTO
LAMPS & OTHER PART
822/2,Guru Nanak Auto Market,
K.Gat■e, IDoellikii-IE h.:
23948091, Mob.: 9891010172,
99902994.25
DEV6AN MOTO RS &
OTHER PART 822/2, Guru
Nanak Auto Market, K-Gate,
Delhi— Mob-:9891010172,
9990299425
_ జై కిసాన్
రైతుసంఘభవనము గ్రా.
శ్రీరాములపల్లి మం.
కమలాపర్, జి., కరీంనగర్
4031 " రిజిసరీ న ం.
డై క్ష )యీఠీక్ట్కూస ంగ్గేక్ఫర్రీ
తేవ " ‘” వసఠో స్థై (రి౦ఖు ఋ ఠో
గ్రా..శ్రేరాములఫల్సీ మరి..కమలాఫధ్రీ…
లో… క’ర్ష్వీం'జ్రాస్య ిం
\/\/ ‘ ‘ ఖో 1;‘ ~.1 .1, I ,~ ళ-
..~ £3()"€—
§C:i')Qjb(;‘)V’_(£)_ "Ann శి I
షే ,() A (’)‘(*f),/))() గ వాగు 0'
1-I
राजकीय प्राथमिक मिद्यालय
भीिों की ढाणी खाटािास, पं.
स.लूणी,जोधपुर।
…८५ अं चं ८१५" हूँ ३ '", ,८८१- -
__" ि ११८८- ' ८३ न ह ै- .. िृ ि - -
' " क्र "'ह ै३'.. . न्नके्व- - ‘ ' - 7 श्री ९ ८"
३ जंक्वें ३३… आमिफ्ला चं टाज्ञकीद्य३
प्राष्ठामिि. क विंद्यम्लय रों "3श्ि…
'…3 िृ "पांड्स 4…९1१८11५म्भ ष.
छंरुच्चािृह'ंणी झादटृम्पध
English Text DTRN [18] Tess-OCR [23] OCRopus [24]
Telegu Text DTRN [18] Tess-OCR [23] OCRopus [24]
Hindi Text DTRN [18] Tess-OCR [23] OCRopus [24]
Fig. 12. Represents Text Recognition Results for Indic Scripts using both Proposed and two existing methods
![Page 14: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/14.jpg)
Fig. 13. Illustrates the Proposed Location obtained empirically from LLDB and Ground Truth Point obtained from Google Street on Indian
Map for Location Luni , Jodhpur
5. Conclusion And Future Work
This work highlights a novel method in extracting regional language from an Image containing any Address information. This is a
novel work of its kind where regional language is extracted from an Image without Internet (offline). The proposed method also
defines a unique Text detection approach where Images of low resolution and varying font size can be handled easily and text can
be extracted. However, it is noted that the proposed method is not robust to occlusion, severe blur and too small font as shown in
Fig 14(c) while capturing image from long distance camera hence address cannot be extracted. The proposed method also fails in
scenarios where Advertisements which contains location of office not in current state as represented in Fig 14(a) , here Picture
Location is Kolkata but proposed method also captures New Delhi but the regional Language of New Delhi and Kolkata are dif-
ferent. It also fails in scenarios like Fig 14(b) where the First word is a country which has different regional language . Currently
Text Recognition can detect English, Hindi and Telegu Scripts, in future the proposed work can extend to 26 Officially accepted
languages in India for unearthing the rarest regional Languages which is unknown to mankind. The work needs to further address
the location ambiguity using other information to detect accurate languages. The Query for selecting the Postoffice can also be
optimized based on Ground truth data. The application of this literature lies in the field of identification of unknown places in
India and also can be used in computer forensics .
(a) Multiple Location (b) Namesake False Location (c) Too Small Font
Fig. 14. Represents the Drawbacks of Proposed Method
![Page 15: Offline Extraction of Indic Regional Language from Natural Scene … · 2018-07-10 · sourab.jha@kgec.edu.in , krishna.bose02@gmail.com , j7.abhi@gmail.com , kousik.dasgupta@gmail.com](https://reader033.vdocuments.us/reader033/viewer/2022050501/5f938571203d5b02701d9fc4/html5/thumbnails/15.jpg)
15
References
1. Z. Tian, W. Huang, T. He, P. He and Y. Qiao, “ Detecting Text in Natural Image with Connectionist Text Proposal Network”, In Proc.
ECCV, 2016, pp 56-72.
2. I. B. Ami, T. Basha and S. Avidan, “Racing bib number recognition”, In Proc. BMVC, 2012.
3. P. Shivakumara, R. Raghavendra, L. Qin, K. B. Raja, T. Lu and U. Pal, “A new multi-modal approach to bib/text detection and recognition
in Marathon images”, Pattern Recognition, Voil. 61, 2017, pp 479-491.
4. B. Shi, X. Bai and S. Belongie,"Detecting Oriented Text in Natural Images by Linking Segments", In Proc. CVPR, 2017, pp 3482-3490.
5. X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, and W. He, “EAST: An Efficient and Accurate Scene Text Detector”, In Proc. CVPR, 2017,
pp 2645-2651.
6. H. Lee and C. Kim, “Blurred image region detection and segmentation”, In Proc. ICIP, 2014, pp 4427-4431.
7. H. Zhang, K. Zhao, Y. Z. Song, and J. Guo, “Text extraction from natural scene image: A survey,” Neurocomputing, vol. 122, pp. 310–323,
2013.
8. H. Raj and R. Ghosh, “Devanagari Text Extraction from Natural Scene Images,” Int. Conf. Adv. Comput. Informatics, pp. 513–517, 2014.
9. A. Buades, B. Coll, and J.-M. Morel, “Non-Local Means Denoising,” Image Process. Line, vol. 1, pp. 208–212, 2011.
10. J. Chen, J. Benesty, Y. A. Huang, and S. Doclo, “New Insights Into the Noise Reduction Wiener Filter,” IEEE Trans. Audio. Speech. Lang.
Processing, vol. 14, no. 4, pp. 1218–1234, 2006.
11. Y. Wu, P. Shivakumara, T. Lu, C. L. Tan, M. Blumenstein and G. H. Kumar, “Contour restoration of text components for recognition in
video/scene images”, IEEE Trans. IP, pp 5622-5634, 2016.
12. https://github.com/libindic/indic-trans
13. Regional Language List ( https://en.wikipedia.org/wiki/Regional_language )
14. https://nrsc.gov.in/hackathon2018/
15. http://www.iapr-tc11.org/mediawiki/index.php?title=KAIST_Scene_Text_Database
16. http://www.iapr-tc11.org/mediawiki/index.php?title=NEOCR:_Natural_Environment_OCR_Dataset
17. N. Sharma, R. Mandal, R. Sharma, U. Pal, and M. Blumenstein, “ICDAR2015 Competition on Video Script Identification (CVSI 2015),”
Proc. Int. Conf. Doc. Anal. Recognition, ICDAR, vol. 2015–November, no. Cvsi, pp. 1196–1200, 2015.
18. P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang, “Reading Scene Text in Deep Convolutional Sequences,” 2015.
19. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl
20. https://data.gov.in/catalog/all-india-pincode-directory
21. https://gist.github.com/gsivaprabu/5336570
22. A. Kumar, “An Efficient Approach for Text Extraction in Images and Video Frames Using Gabor Filter,” Int. J. Comput. Electr. Eng., vol.
6, no. 4, pp. 316–320, 2014.
23. https://github.com/tesseract-ocr/tesserac
24. https://github.com/tmbdev/ocropy
25. Xiaoming Huang, Tao Shen, Run Wang, and Chenqiang Gao, “Text detection and recognition in natural scene images,” 2015 Int. Conf. Es-
tim. Detect. Inf. Fusion, no. ICEDlF, pp. 44–49, 2015.
26. U. Roy, A. Mishra, K. Alahari, and C. V Jawahar, “Scene Text Recognition and Retrieval for Large Lexicons,” Accv2014, pp7–10,2014.
27. J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A K-Means Clustering Algorithm,” Appl. Stat., vol. 28, no. 1, p. 100, 197
28. https://pypi.org/project/goslate/
29. https://pypi.org/project/reverse_geocode
30. X. Yin, X. Yin, K. Huang, and H. Hao, “Robust Text Detection in Natural Scene Images,” vol. 36, no. 5, pp. 970–983, 2014.4
31. A. Gordo, A. Forn, E. Valveny, and J. Almaz, “Word Spotting and Recognition with Embedded Attributes,” vol. 36, no. 12, pp. 2552–2566,
2014.
32. H. Zhao, Y. Hu, and J. Zhang, “Character Recognition via a Compact Convolutional Neural Network,” 2017.