unconstrained scene text and video text recognition …mohit.jain/research_logs/...unconstrained...

5
Unconstrained Scene Text and Video Text Recognition for Arabic Script Mohit Jain, Minesh Mathew and C. V. Jawahar Center for Visual Information Technology, IIIT Hyderabad, India. Email : [email protected], [email protected], [email protected] Abstract—We demonstrate the utility of an end-to-end train- able CNN- RNN hybrid architecture in recognizing Arabic text in videos and natural scenes. Our solution is based upon Con- volutional Recurrent Neural Network (CRNN) model, which is proven quite effective in English scene text recognition. The model follows a segmentation-free, sequence to sequence transcription approach. The network transcribes a sequence of convolutional features from the input image to a sequence of target labels. This does away with the need for segmenting input image into constituent characters/glyphs, which is often difficult in case of Arabic. Further, the ability of RNNs to model contextual dependencies yields superior recognition results. The network is trained on synthetic images rendered from a large vocabulary of Arabic words and phrases. Our solution is evaluated on two publicly available video text datasets - ALIF and ACTIV. For the scene text recognition task a new dataset - IIIT Arabic scene text dataset is created. We report state of the art results on both of the video text datasets. We also benchmark our results on the IIIT Arabic scene text dataset. KeywordsArabic, Scene Text, Video Text, Synthetic Data, Deep Learning, Text Recognition. I. I NTRODUCTION For many years, the focus of research on text recognition in Arabic has been on printed and handwritten documents [7], [10], [11], [12]. Majority of the works in this space were on individual character recognition [11], [12]. Since segmenting a Arabic word or line image into its sub units is challenging, such models did not scale well. In recent years there has been a shift towards segmentation-free text recognition models, mostly based on Hidden Markov Models (HMM) or Recur- rent Neural Networks(RNN). Such models generally follow a sequence-to-sequence approach wherein the input line/word image is directly transcribed to a sequence of labels. [8], [9] use RNN based models for recognizing printed/handwritten Arabic script. Our approach to recognizing Arabic text on videos (video text recognition) and natural scenes (scene text recognition) follows the same paradigm. Video text recognition in Arabic has gained interest recently [21], [22], [34] and there are two benchmarking datasets available for the task [21], [22]. To the best of our knowledge, there has been no work so far on scene text recognition (recognizing text in natural scene images) in Arabic. The vision community experienced a strong revival of neural networks based solutions in the form of Deep Learning. This process was stimulated by the great success of models like Deep Convolutional Neural Networks (DCNNs) in various feature extraction, object detection and classification tasks as seen in [1], [3]. These tasks however pertain to a set Fig. 1: Examples of Arabic Scene Text (left) and Video Text (right). of problems where the subjects appear in isolation, rather than appearing as a sequence. Recognising such sequence- based objects often requires the model to predict a series of description labels instead of a single label. DCNNs are not well suited for such tasks as they generally work well in scenarios where the inputs and outputs are bounded by fixed dimensions. Also, the lengths of these sequence-like objects might vary drastically and that escalates the problem difficulty. Recurrent neural networks (RNNs) tackle the problems faced by DCNNs for sequence-based learning by performing a forward pass for each segment of the sequence-like-input. Such models often involve a pre-processing/feature extraction step, where the input is first converted into a sequence of feature vectors [4], [5]. These feature extraction stage would be independent of RNN-pipeline and hence they are not end-to-end trainable. The sudden boom in video sharing on social networking websites and the increasing number of TV channels in today’s world reveals videos to be a fundamental source of informa- tion. Effectively managing, storing and retrieving such huge amounts of data is not a trivial task. Video text recognition can greatly aid video content analysis and understanding, with the recognized text giving a direct and concise description of the stories being depicted in the videos. In case of news videos, the superimposed tickers running on the edges of the video frame are generally highly correlated to the people involved or the story being portrayed and hence provide a brief summary of the news event. These text transcriptions can greatly benefit the indexing mechanism of an automated video retrieval system by providing semantic clues about the events being described in the videos. Reading text in natural scenes is relatively harder task

Upload: lamdien

Post on 08-Jun-2018

245 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Unconstrained Scene Text and Video Text Recognition …mohit.jain/research_logs/...Unconstrained Scene Text and Video Text Recognition for Arabic Script Mohit Jain, Minesh Mathew and

Unconstrained Scene Text and Video TextRecognition for Arabic Script

Mohit Jain, Minesh Mathew and C. V. JawaharCenter for Visual Information Technology, IIIT Hyderabad, India.

Email : [email protected], [email protected], [email protected]

Abstract—We demonstrate the utility of an end-to-end train-able CNN-RNN hybrid architecture in recognizing Arabic textin videos and natural scenes. Our solution is based upon Con-volutional Recurrent Neural Network (CRNN) model, which isproven quite effective in English scene text recognition. The modelfollows a segmentation-free, sequence to sequence transcriptionapproach. The network transcribes a sequence of convolutionalfeatures from the input image to a sequence of target labels.This does away with the need for segmenting input image intoconstituent characters/glyphs, which is often difficult in caseof Arabic. Further, the ability of RNNs to model contextualdependencies yields superior recognition results. The network istrained on synthetic images rendered from a large vocabularyof Arabic words and phrases. Our solution is evaluated on twopublicly available video text datasets - ALIF and ACTIV. For thescene text recognition task a new dataset - IIIT Arabic scene textdataset is created. We report state of the art results on both ofthe video text datasets. We also benchmark our results on theIIIT Arabic scene text dataset.

Keywords—Arabic, Scene Text, Video Text, Synthetic Data, DeepLearning, Text Recognition.

I. INTRODUCTION

For many years, the focus of research on text recognitionin Arabic has been on printed and handwritten documents [7],[10], [11], [12]. Majority of the works in this space were onindividual character recognition [11], [12]. Since segmentinga Arabic word or line image into its sub units is challenging,such models did not scale well. In recent years there hasbeen a shift towards segmentation-free text recognition models,mostly based on Hidden Markov Models (HMM) or Recur-rent Neural Networks(RNN). Such models generally follow asequence-to-sequence approach wherein the input line/wordimage is directly transcribed to a sequence of labels. [8], [9]use RNN based models for recognizing printed/handwrittenArabic script. Our approach to recognizing Arabic text onvideos (video text recognition) and natural scenes (scene textrecognition) follows the same paradigm. Video text recognitionin Arabic has gained interest recently [21], [22], [34] and thereare two benchmarking datasets available for the task [21], [22].To the best of our knowledge, there has been no work so faron scene text recognition (recognizing text in natural sceneimages) in Arabic.

The vision community experienced a strong revival ofneural networks based solutions in the form of Deep Learning.This process was stimulated by the great success of modelslike Deep Convolutional Neural Networks (DCNNs) in variousfeature extraction, object detection and classification tasksas seen in [1], [3]. These tasks however pertain to a set

Fig. 1: Examples of Arabic Scene Text (left) and Video Text(right).

of problems where the subjects appear in isolation, ratherthan appearing as a sequence. Recognising such sequence-based objects often requires the model to predict a series ofdescription labels instead of a single label. DCNNs are not wellsuited for such tasks as they generally work well in scenarioswhere the inputs and outputs are bounded by fixed dimensions.Also, the lengths of these sequence-like objects might varydrastically and that escalates the problem difficulty. Recurrentneural networks (RNNs) tackle the problems faced by DCNNsfor sequence-based learning by performing a forward pass foreach segment of the sequence-like-input. Such models ofteninvolve a pre-processing/feature extraction step, where theinput is first converted into a sequence of feature vectors [4],[5]. These feature extraction stage would be independent ofRNN-pipeline and hence they are not end-to-end trainable.

The sudden boom in video sharing on social networkingwebsites and the increasing number of TV channels in today’sworld reveals videos to be a fundamental source of informa-tion. Effectively managing, storing and retrieving such hugeamounts of data is not a trivial task. Video text recognition cangreatly aid video content analysis and understanding, with therecognized text giving a direct and concise description of thestories being depicted in the videos. In case of news videos, thesuperimposed tickers running on the edges of the video frameare generally highly correlated to the people involved or thestory being portrayed and hence provide a brief summary ofthe news event. These text transcriptions can greatly benefit theindexing mechanism of an automated video retrieval system byproviding semantic clues about the events being described inthe videos.

Reading text in natural scenes is relatively harder task

Page 2: Unconstrained Scene Text and Video Text Recognition …mohit.jain/research_logs/...Unconstrained Scene Text and Video Text Recognition for Arabic Script Mohit Jain, Minesh Mathew and

compared to printed or handwritten text recognition. Theproblem has been drawing increasing research interest in recentyears. This can partially be attributed to the rapid developmentof wearable and mobile devices such as smart phones, digitalcameras, and the latest tech like google-glass and self-drivingcars, where scene text is a key module to a wide range of prac-tical and useful applications. While the recognition of text inprinted documents has been studied extensively in the form ofOptical Character Recognition (OCR) Systems, these methodsgenerally don’t generalize very well to a natural scene settingwhere factors like inconsistent lighting conditions, variablefonts, orientations, background noise and image distortions addto the problem complexity. Even though Arabic is the fourthmost spoken language in the world after Chinese, English andSpanish, to the best of our knowledge, there hasn’t been anywork in the field of scene text recognition, which makes thisarea of research ripe for exploration.

A. Intricacies of Arabic Script

Automatic recognition of Arabic is a pretty challengingtask due to various intricate properties of the script. Thereare only 28 letters in the script and it is written in a semi-cursive fashion from right to left. The letters of the scriptare generally connected by a line at its bottom to the letterspreceding and following it, except for 6 letters which can belinked only to their preceding letters. Such an instance in aword creates a paw (part-of Arabic word) in the word. Arabicscript has another exceptional characteristic where the shapeof a letter changes depending on whether the letter appearsin the word in isolation, at the beginning, middle or at theend. So in general, each letter can have one to four possibleshapes which might have little or no similarity in shapewhatsoever. Another frequent occurrence in Arabic is that ofdots. The addition of a single dot to a letter can change themeaning of the word completely. Arabic also follows ligature,where two letters when combined form a third different letterin a way that they cannot be separated by a simple base-line (i.e. a complex shape represents this combined letter).These distinctive characteristics make automated Arabic scriptrecognition more challenging than most other scripts.

B. Related Work

Though there has been a lot of work done in the field oftext transcription in natural scenes and videos for the Englishscript [4], [5], [13], [27], it is still in a nascent state as far asArabic script is considered. Previous attempts made to addresssimilar problems in English [27], [28] first detect individualcharacters and then character-specific DCNN models are usedto recognize these detected characters. The short-comings ofsuch methods are that they require training a strong characterdetector for accurately detecting and cropping each characterout from the original word. Also, for Arabic, this task becomeseven more difficult due to the connectedness intricacies of thescript, as discussed earlier. Another approach by Jaderberget al. [13], was to treat scene text recognition as an imageclassification problem instead. To each image, they assign aclass label from a lexicon of words spanning the Englishlanguage (90k most frequent words were chosen to put a boundon the span-set). This approach however is limited to the sizeof lexicon used for its possible unique transcriptions and the

large number of output classes add to training complexity.Hence the model is not scalable to inflectional languageslike Arabic where number of unique words is much highercompared to English.

Another category of solutions typically embed image andits text label in a common subspace and retrieval/recognition isperformed on the learnt common representations. For example,Almazan et al. [29] embed word images and text strings in acommon vector-subspace, and thus convert the task of wordrecognition into a retrieval problem. Yao et al. [31] and Gordoet al. [32] used mid-level features for scene text recognition.

The recent works in this field [33], [5] generally fol-low a transcription approach to transcribe the input imageinto a sequence of labels. When the earlier works followingthe transcription model used handcrafted features, [33] putforward a novel approach where in the recurrent layers arefed by convolutional features extracted from the input image.Our approach is borrowed from this particular architecturecalled CRNN where the hybrid CNN-RNN network with aConnectionist Temporal Classification (CTC) loss is trainedend-to-end.

Works on text extraction from videos has generally beenin four broad categories, edge detection methods [35], [36],[37], extraction using connected components [38], [39], textureclassification methods [40] and correlation based methods [41],[42]. The previous attempts at solving video text for Arabicused a two separate routines; one for extracting relevant imagefeatures and another for classifying features to script labels forobtaining the target transcription. [34] uses text-segmentationand statistical feature extraction followed by fuzzy k-nearestneighbour techniques to obtain transcriptions. Similarly, [20]experiment with two feature extraction routines; CNN basedfeature extraction and Deep Belief Network (DBN) basedfeature extraction, followed by a Bi-directional Long ShortTerm Memory (LSTM) layer [15], [16].

The rest of the paper is organized as follows; sectionII describes in detail the CRNN architecture, section III de-scribes the experiments we perform to test the CRNN with theimplementation level details and verified results, section IVconcludes with the findings of our work.

II. CRNN ARCHITECTURE

The deep neural network architecture, CRNN, consists ofthree components. The initial convolutional layers, the middlerecurrent layers and a final transcription layer.

The CRNN architecture is essentially a hybrid CNN-RNNmodel. The role of the convolutional layers is to extract robustfeature representations from the input image and feed it into therecurrent layers which in turn will transcribe them to an outputsequence - a sequence of labels or characters. The sequenceto sequence transcription is achieved by a CTC layer at theoutput.

The convolutional layers follow a VGG [6] style architec-ture without the fully-connected layers. All the images arescaled to a fixed height before being fed to the convolutionallayers. The convolutional components then create a sequenceof feature vectors from the feature maps by splitting themcolumn-wise, which then act as inputs to the recurrent layers

Page 3: Unconstrained Scene Text and Video Text Recognition …mohit.jain/research_logs/...Unconstrained Scene Text and Video Text Recognition for Arabic Script Mohit Jain, Minesh Mathew and

W*32 W*32*64 W/2*16*128 W/4*8*256 W/4*8*256 W/4*4*512 W/4*4*512

Input Image

Conv 1W: 3*3

S: 1P:1

W/4*2*512 W/4 * 512 1*256 1*256 1* #labels

CTC

LABEL

SEQUENCE

Conv 2W: 3*3

S: 1P:1

Max Pool

W: 2*2S: 2

Conv 3W: 3*3

S: 1P:1

Max Pool

W: 2*2S: 2

Conv 4W: 3*3

S: 1P:1

Conv 5W: 3*3

S: 1P:1

Max Pool

W: 1*2S: 2

Conv 6W: 3*3

S: 1P:1

Conv 7W: 2*2

S: 1P: 0

Max Pool

W: 1*2S: 2

BatchNorm.

BatchNorm.

Bidirectional LSTM

Bidirectional LSTM

softmax

W/4*1*512

FeatureSequence

Fig. 2: CRNN Architecture.The symbols ‘k’, ‘s’ and ‘p’ stand for kernel size, stride and padding size respectively.

(Fig. 2). The layers of convolution, max-pool and element-wiseactivation functions are translation invariant as they operateon local regions. Hence, each column feature from the featuremap corresponds to a rectangle region of the original image.These rectangular regions are themselves in the same order totheir corresponding columns on the feature maps from left toright and hence can be considered as the image descriptor forthat region. This feature descriptor is highly robust and mostimportantly can be trained to be adopted to a wide variety ofproblems [1], [3], [13].

On top of the convolutional layers, the recurrent layerstake each frame from the feature sequence generated by theconvolutional layers and make predictions. The recurrent layersconsist of deep bidirectional LSTM nets. RNNs are an optimalchoice for our unconstrained end-to-end system as they have astrong capability of capturing contextual information withina sequence. RNNs are capable of handling variable lengthsequences. Number of parameters in a RNN is independentof the length of the sequence. All we need to do is to unrollthe network as many times as the number of time-steps inthe input sequence. This in our case helps in unconstrainedrecognition. Your output could be any sequence of labelsderived from your label set. Traditional RNN units (vanillaRNNs) face the problem of vanishing gradients [14] and hencewe use (LSTM) units instead, a type of RNN unit specificallydesigned to tackle the vanishing-gradients problem [15], [16].The special design of the LSTM cell allows it to learn long-termdependencies and hence it is a good fit for our image-basedsequence learning task where such dependencies are frequent.In a text recognition problem, contexts from both directions(left-to-right and right-to-left) are useful and complementaryto each other in outputting the right transcription. Therefore,we combine a forward and a backward oriented LSTM tocreate a bi-directional LSTM unit. Multiple such bi-directionalLSTM layers can be stacked to make the network deeper andgain higher levels of abstractions over the image-sequences asshown in [17].

The transcription layer at the top of the CRNN is used totranslate the predictions generated by the recurrent layers intolabel sequences for the target language. The (CTC) layer’sconditional probability is used in the objective function asshown in [18]. The objective is to minimize the negativelog-likelihood of conditional probability of ground truth. This

objective function calculates a cost value directly from animage and its ground truth label sequence, eliminating the needof manually label all the individual components in sequence.In other words the input sequence can be transcribed directlyto the output sequence without the need for a target definedat each time-step. The CRNN, though composed of multiplemodules of convolutional and recurrent layers, can hence betrained using a single loss function and is end-to-end trainable.The complete network configuration used for the experimentscan be seen in (Fig. 2).

III. EXPERIMENTS

In this section we demonstrate the efficacy of abovearchitecture in recognizing Arabic script appearing in videoframes and natural scene images. In case of scene text, ourexperiments concern only with scene text recognition. It isassumed that input images are cropped word images fromnatural scenes, not full scene images. Since there were noprevious works in Arabic scene text recognition we introduce anew dataset - IIIT Arabic scene text dataset and benchmark ourresults on this new dataset. For video text recognition problemthe results are reported on two existing video text datasets -ALIF [20] and ACTIV [22].

Fig. 3: Sample images from the rendered synthetic Arabicscene text dataset. The images closely resemble real world

scene images.

A. Datasets

Models for both video text and scene text recognitionproblems are trained using synthetic images rendered from alarge vocabulary using freely available Arabic Unicode fonts(Fig. 3). Details on the rendering process are detailed here [19].Around 2 million video text line images and 4 million scenetext word images were used for training the respective models.

Page 4: Unconstrained Scene Text and Video Text Recognition …mohit.jain/research_logs/...Unconstrained Scene Text and Video Text Recognition for Arabic Script Mohit Jain, Minesh Mathew and

The model for video text recognition task was initially trainedon the Synthetic video text dataset and then fine-tuned on thetrain partitions of real-world datasets, ALIF and ACTIV.

ALIF dataset consists of 6,532 cropped text line imagesfrom 5 popular Arabic News channels. ACTIV dataset is largerthan ALIF and contains 21,520 line images from popularArabic News channels. The dataset contains video frameswhere in bounding boxes of text lines are annotated.

Fig. 4: Sample images from the IIIT Arabic scenetext dataset.

IIIT Arabic scene text dataset was curated by downloadingfreely available images containing Arabic script from GoogleImages. The dataset consists of 500 word images of Arabicscript occurring in various scenarios like local markets &shops, billboards, navigation signs, graffiti, etc. and spans alarge variety of naturally occurring image-noises and distor-tions (Fig. 4). The images were manually annotated by humanexperts of the script and the transcriptions as well as the imagedata is being made publicly available for future research groupsto compare and improve performance in solving this task.

B. Implementation Details

The CRNN model’s convolutional layers follow the VGGarchitecture [6].In the 3rd and 4th max-pooling layers thepooling windows used are rectangular instead of the usualsquare windows used in VGG. The advantage of doing thisis that the feature maps obtained after the convolutional layersare wider and hence we obtain longer feature sequences asinputs for the recurrent layers that follow. To enable fasterbatch learning all input images are re-sized to a fixed width andheight (32x100 for scene text and 32x504 for video text). Wehave observed that resizing all images to a fixed width is notaffecting the performance much. The images are horizontallyflipped before feeding them to the convolutional layers sinceArabic is read from right to left.

To tackle the problems of training such deep convolutionaland recurrent layers, we used the batch normalization [24]technique. Two batch-norm layers were inserted after the 5thand 6th convolutional layers respectively, which acceleratedthe training process greatly.

The network is trained with stochastic gradient descent(SGD). Gradients are calculated by the back-propagation al-gorithm. Precisely, the transcription layers’ error differentialsare back-propagated with the forward-backward algorithm,as shown in [18]. While in the recurrent layers, the Back-Propagation Through Time (BPTT) [25] algorithm is appliedto calculate the error differentials. The hassle of manuallysetting the learning-rate parameter is taken care of by usingADADELTA optimization [26].

C. Results

The accuracies obtained by CRNN and other previousworks [20], [22] in the field of video text recognition are

Fig. 5: Qualitative results of the SceneText Recognition.Onthe left are images which were correctly recognized and onthe right are examples which the CRNN failed to recognize.

stated in Table I. Since there has been no work done in Arabicscene text recognition, we compare the results obtained on IIITArabic scene text dataset using with a popular English freeOCR - Tesseract [43]. The performance has been evaluatedusing the following metrics; CRR - Character RecognitionRate, WRR - Word Recognition Rate, LRR - Line RecognitionRate. In the below equations, RT and GT stand for recognizedtext and ground truth respectively.

CRR =(nCharacters−

∑EditDistance(RT,GT ))

nCharacters

WRR =nWordsCorrectlyRecognized

nWords

LRR =nTextImagesCorrectlyRecognized

nImages

It should be noted that even though the methods comparedon for video text recognition use a separate convolutionalarchitecture for feature extraction, unlike our end-to-end train-able CRNN architecture, we obtain better character and line-level accuracies for the Arabic video text recognition task andset the new state-of-the-art for the same.

TABLE I: Accuracy for Video text

VideoText ALIF Test1 ALIF Test2 AcTiVCRR(%) LRR(%) CRR(%) LRR(%) CRR(%) LRR(%)

ConvNet-BLSTM [20] 94.36 55.03 90.71 44.90 –DBN-BLSTM [20] 90.73 39.39 87.64 31.54 –ABBYY [23] 83.26 26.91 81.51 27.03 –CRNN 98.17 79.67 97.84 77.98 97.44 67.08

TABLE II: Accuracy for Scene Text

SceneText IIIT-ArabicCRR(%) WRR(%)

Tesseract [43] 17.07 5.92CRNN 69.55 39.25

Lower accuracies on scene text recognition problem testifythe inherent difficulty associated with the problem comparedto printed or video text recognition.

IV. CONCLUSION

We demonstrate that state of the deep learning techniquescan be successfully adapted to rather challenging tasks likeArabic text recognition. The newer, script and language agnos-tic approaches are well suited for low resource languages like

Page 5: Unconstrained Scene Text and Video Text Recognition …mohit.jain/research_logs/...Unconstrained Scene Text and Video Text Recognition for Arabic Script Mohit Jain, Minesh Mathew and

Arabic where the traditional methods often involved languagespecific modules. The success of RNNs in sequence learningproblems has been instrumental in the recent advances inspeech recognition and image to text transcription problems.And this came as a boon for languages like Arabic wherethe segmentation of words into sub word units was oftentroublesome. The sequence learning approach could directlytranscribe the images and also model the context in both for-ward and backward directions. And in the CRNN architecturethis seq2seq framework is fed by a deep convolutional featureextractor which derives robust feature sequences from the inputimage.

On video text recognition we report the best results so far,on both the datasets. Compared to the previous best, where aCNN and RNN were trained separately and used as a feature-extractor and classifier respectively, the architecture we followcan be trained end-to-end.

Scene text recognition is a much harder problem comparedto OCR or video text recognition. The variability in termsof lighting, distortions and typography make the learningpretty hard. With better feature representations and learningalgorithms available, we believe the focus should now shiftto harder problems like scene text recognition. We hope theintroduction of IIIT Arabic scene text dataset and the initialresults would instill an interest among the Arabic computervision community.

ACKNOWLEDGMENT

The authors would like to thank Maaz Anwar for helpingannotate the IIIT Arabic dataset and Dr. Girish Varma forhis timely help and discussions while finalising the networkarchitecture.

REFERENCES

[1] A. Krizhevsky, I. Sutskever and G.E. Hinton, Imagenet classification withdeep convolutional neural networks. NIPS 2012.

[2] Y. LeCun, L. Bottou, Y. Bengio and P.Haffiner, Gradient-based learningapplied to document recognition. Proceedings of the IEEE, 1998.

[3] R.B. Girshick, J. Donahue, T. Darrel and J. Malik, Rich feature hierar-chies for accurate object detection and semantic segmentation. CVPR2014.

[4] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke and J.Schidhuber, A novel connectionist system for unconstrained handwritingrecognition. TPAMI, 2009.

[5] B. Su and S. Lu, Accuracte scene text recognition based on recurrentneural network. ACCV 2014.

[6] K. Simonyan and A. Zisserman. Very deep convolutional networks forlarge-scale image recognition. CoRR 2014.

[7] Prasad, Rohit, et al. Improvements in hidden Markov model based ArabicOCR, ICPR 2008.

[8] Yousefi, Mohammad Reza, et al. A comparison of 1D and 2D LSTMarchitectures for the recognition of handwritten Arabic, SPIE 2015.

[9] Ul-Hasan, Adnan, et al, Offline printed Urdu Nastaleeq script recognitionwith bidirectional LSTM networks, ICDAR 2013.

[10] El-Hajj, Ramy, Laurence Likforman-Sulem, and Chafic Mokbel, Arabichandwriting recognition using baseline dependant features and hiddenMarkov modeling. ICDAR 2005.

[11] Amin, Adnan, Humoud Al-Sadoun, and Stephen Fischer, Hand-printedArabic character recognition system using an artificial network. Patternrecognition 29.4 (1996): 663-675.

[12] Amin, Adnan. Off-line Arabic character recognition: the state of theart. Pattern recognition 31.5 (1998): 517-530.

[13] M. Jaderberg, K. Simoyan, A. Vedaldi and A. Zisserman, Reading textin the wild with convolutional neural networks. IJCV 2015.

[14] Y. Bengio, P.Y. Simard and P. Frasconi, Learning long-term dependen-cies with gradient descent is difficult. NN 1994.

[15] F.A. Gers, N.N. Schraudolph and J. Schmidhuber, Learning precisetiming with LSTM recurrent networks. JMLR 2002.

[16] S. Hochreiter and J. Schmidhuber, Long short-term memory. NeuralComputation 1997.

[17] A. Graves, A. Mohamed and G.E. Hinton, Speech recognition with deeprecurrent neural networks. ICASSP 2013.

[18] A. Graves, S. Fernandez, F.J. Gomez and J. Schmidhuber, Connection-ist temporal classification: labelling unsegmented sequence data withrecurrent neural networks. ICML 2006.

[19] Praveen Krishnan and C.V. Jawahar, Generating Synthetic Data for TextRecognition. arXiv 2016.

[20] S. Yousfi, B. Sid-Ahmed and G. Christophe, ALIF: A dataset for Arabicembedded text recognition in TV broadcast. ICDAR 2015.

[21] Yousfi, Sonia, Sid-Ahmed Berrani, and Christophe Garcia, Arabictext detection in videos using neural and boosting-based approaches:Application to video indexing. ICIP 2014.

[22] O. Zayene, S. Berrani and C. Garcia, A dataset for Arabic text detection,tracking and recognition in news videos-AcTiV. ICDAR 2015.

[23] K. Wang, B. Babenko and S. Belongie, End-to-end scene text recogni-tion. In ICCV, 2011.

[24] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. ICML 2015.

[25] P.J. Werbos, Backpropagation through time: what it does and how todo it. Proceedings of the IEEE 1990.

[26] M.D. Zeiler, ADADELTA: an adaptive learning rate method. CoRR2012.

[27] T. Wang, D.J. Wu, A. Coates and A.Y. Ng, End-to-end text recognitionwith convolutional neural networks. CVPR 2014.

[28] A. Bissacco, M. Cummins, Y. Netzer and H. Neven, Photoocr: Readingtext in uncontrolled conditions. ICCV 2013.

[29] J. Almazan, A. Gordo, A. Fornes and E. Valveny, Word spotting andrecognition with embedded attributes. TPAMI 2014.

[30] J.A. Rodriguez-Serrano, A. Gordo and F. Perronnin, Label embedding:A frugal baseline for text recognition. IJCV 2015.

[31] C. Yao, X. Bai, B. Shi and W. Liu, Strokelets : A learned multi-scalerepresentation for scene text recognition. ICPR 2012.

[32] A. Gordo, Supervised mid-level features for word image representation.CVPR 2015.

[33] B. Shi, X. Bai and C. Yao, An end-to-end trainable neural networkfor image-based sequence recognition and its application to scene textrecognition. arXiv 2015.

[34] Halima, M. Ben, Hichem Karray, and Adel M. Alimi, Arabic textrecognition in video sequences. arXiv 2013.

[35] L. Agnihotri and N. Dimitrova, Text detection for video analysis. CVPRworkshop, 1999.

[36] L. Agnihotri., N.Dimitrova, M.Soletic, Multi-layered Videotext Extrac-tion Method. ICME 2002.

[37] S. Hua,X., X.-R. Chen., Automatic Location of Text in Video Frames.MIR 2001.

[38] A. K. Jain and B. Yu, Automatic text location in images and videoframes. PR 1998.

[39] R. Lienhart and F. Stuber, Automatic text recognition indigital videos.Proceedings of SPIE Image and Video 1996.

[40] H. Karray, M. Ellouze and M.A. Alimi, Indexing Video Summaries forQuick Video Browsing. Computer Communications and Networks 2009.

[41] H. Karray , A.M. Alimi, Detection and Extraction of the Text in a videosequence. ICES 2005.

[42] M. Kherallah, H. Karray, M. Ellouze, A.M. Alimi, Toward an Interac-tive Device for Quick News Story Browsing. ICPR 2008,

[43] ”Tesseract OCR”, https://github.com/tesseract-ocr/tesseract.