text detection in natural scenes with stroke width...

1

Text Detection in Natural Scenes with

Stroke Width Transform

Gili Werner

Ben Gurion University, Israel

February, 2013

Abstract

My project aims at detecting text segments in an image of a natural scene, by using an

enhanced version of the Stroke Width Transform [1]. The application receives an RGB image to

search in, and returns a new image where the discovered text segments are marked. Due to the

features of the SWT, the resulting system is able to detect text regardless of its scale, direction,

font and language.

2

Table of Contents

Introduction ……………………………………………………………………………………………………………………………………. 3

Problem Domain and Assumptions …………………………………………………………………………………………………. 3

The Stroke Width Transform …………………………………………………………………………………………………………… 4

The Application ……………………………………………………………………………………………………………………………….. 7

Experiments and Results …………………………………………………………………………………………………………………. 9

Future Work ……………………………………………………………………………………………………………………………………. 14

References ………………………………………………………………………………………………………………………………………. 14

Appendix – User Manual …………………………………………………………………………………………………………………. 15

3

Introduction

Detecting text in a natural scene is an important part of many Computer Vision tasks. For example, the

performance of optical character recognition (OCR) algorithms can be highly improved by first

identifying the regions of text in the image.

Text detection in natural scenes is a highly researched field, and there are numerous approaches for

solving this problem. However, most text detection schemes restrict the user to specific languages, scale

and direction of the text. Yet, in a natural scene, we may not want make such assumptions and restrict

the results accordingly. There seems to be a tradeoff between the amount of restrictions we apply and

the quality of the result. The more we limit our search, the less noise we encounter.

In this project I attempted to create a powerful and reliable tool for detecting text regions in an image,

by using the Stroke Width Transform (SWT). The approach of SWT is grouping pixels together in an

intelligent way, instead of looking for separating features of pixels. By using the SWT I was able to relax

the assumptions I mentioned above, and still maintain a high quality result. My goal was to implement

and improve the algorithm defined in [1], so most of the text in a natural image will be discovered, with

as little noise as possible. A description of the different methods I attempted for improving the

algorithm can be found in the section ‘The Application’.

Problem Domain and Assumptions

My application expects to receive an image of a natural scene, as opposed to scanned pages, faxes and

business cards. It can receive any image of the JPEG format. The standard size of the tested image is

800x600 (for larger images the application will take a bit longer). The image does not have to be of top

quality (examples will be given in the Experiments and Results section), but the text should not be

blurry.

The text in the image can be of a variety of sizes, however, I assume minimal height and width of 13

pixels and a maximal height/width of 300 pixels.

Also, the recognition is independent of the language of the text and the direction of it. Curved text will

be more difficult to detect, yet it is still possible.

Furthermore, the text detected in an execution can either be light text on a dark background, or dark

text on a light background. This is due to the features of the Stroke Width Transform, as described later

on.

The result displays the recognized characters grouped into a region. These regions may have several

letters missing in them; however, as long as the bounding box of the region contains them, the omitting

is not as problematic.

4

The Stroke Width Transform

In this section I will describe the Stroke Width Transform algorithm as it is presented in [1], with several

additions and enhancements. These additions will be discussed in further extent in the next section ‘The

Application’.

The algorithm receives an RGB image and returns an image of the same size, where the regions of

suspected text are marked. It has 3 major steps: the stroke width transform, grouping the pixels into

letter candidates based on their stroke width, and finally, grouping letter candidates into regions of text.

The Stroke Width Transform

A stroke in the image is a continuous band of a nearly constant width. An example of a stroke is shown

in figure 1(a). The Stroke Width Transform (SWT) is a local operator which calculates for each pixel the

width of the most likely stroke containing the pixel.

Figure 1

First, all pixels are initialized with ∞ as their stroke width. Then, we calculate the edge map of the image

by using the Canny edge detector. We consider the edges as possible stroke boundaries, and we wish to

find the width of such stroke. If p is an edge pixel, the direction of the gradient is roughly perpendicular

to the orientation of the stroke boundary. Therefore, the next step is to calculate the gradient direction

gp of the edge pixels, and follow the ray r=p+n*gp (n>0) until we find another edge pixel q. If the

gradient direction gq at q is roughly opposite to gp, then each pixel in the ray is assigned the distance

between p and q as their stroke width, unless it already has a lower value. If, however, an edge pixel q is

not found, or gq is not opposite to gp, the ray is discarded. In order to accommodate both bright text on

5

a dark background and dark text on a bright background, we need to apply the algorithm twice: once

with the ray direction gp and once with –gp.

After the first pass described above, pixels in complex locations might not hold the true stroke width

value (figure 2(b)). For that reason, we will pass along each non-discarded ray, where each pixel in the

ray will receive the minimal value between its current value, and the median value along that ray. (In the

original algorithm, the pixels are assigned the median value, yet from my experiments, I got better

results when I took the minimum).

Figure 2

Removing single lines from the SW-Map

In order to improve the results, I added another step to the algorithm, whose purpose is to improve the

character separation. Many times the letters in the image connect with each other, and the algorithm

recognizes a group of characters as a single component. Since the features of a bundle of characters do

not necessarily conform to the features of a single character, these components might be rejected in the

next phase. When reviewing the steps of the algorithm, I noticed that the SW operator returns many

stray lines that connect letters together. After removing such lines, these letters are no longer

considered part of the same component. Therefore, I added a step where the algorithm goes over each

pixel, and if its neighborhood does not contain enough pixels from its component, that pixel is removed

from the component. The results from this addition will be demonstrated in the section ‘Experiments

and Results’.

Finding letter candidates

We now have a map of the most likely stroke-widths for each pixel in the original image. The next step is

to group these pixels into letter candidate. This will be done by first grouping pixels with similar stroke

width, and then applying several rules to distinguish the letter candidates.

The grouping of the image will be done by using a Connected Component algorithm. In order to allow

smoothly varying stroke widths in a letter, we will let two pixels to be grouped together if their SWT

ratio is less than 3.0.

6

Now we must detect the connected components which can pass as letter candidates, by applying a set

of fairly flexibly rules. These rules are as follows:

� The variance of the stroke-width within a component must not be too big. This helps with

rejecting foliage in natural images, which are commonly mistaken for text.

� The aspect ratio of a component must be within a small range of values, in order to reject long

and narrow components.

� The ratio between the diameter of the component and its median stroke width to be less than a

learned threshold. This also helps reject long and narrow components.

� Components whose size is too large or too small will also be ignored. This is done by limiting the

length, width, and pixel count of the component.

� In addition to these rules so far, I added another rule which helped me eliminate much noise.

This rule state that the ratio between the pixel count of the component and the amount

of pixels in the bounding box of the component should be within a bounded range. This

rejects components that spread over a large space, yet have a small pixel count, and

components which cover most of their bounding box.

The thresholds used we initially taken from the Stroke Width Transform description [1], and were

updated slightly according to the results.

The remaining connected components are considered letter candidates, and are now to be aggregated

into regions of text.

Grouping letter candidates into text regions

Since single letters are not expected to appear in images, we will now attempt to group closely

positioned letter candidates into regions of text. This filters out many falsely-identified letter candidates,

and improves the reliability of the algorithm results.

Again, we will use a small set of rules to group letters together into regions of text. These rules will

consider pairs of letters, and are as follows:

� Two letter candidates should have similar stroke width. For this reason we limit the ratio

between the median stroke-widths to be less than some threshold.

� The ratio between the heights of the letters and between the widths of the letters must not

exceed 2.5. This is due to capital letters next to lower case letters.

� The distance between letters must not exceed three times the width of the wider one.

� Characters of the same word are expected to have a similar color; therefore we compare the

average color of the candidates for pairing.

7

� Again, I added another rule which restricts the pixels count of the pair of letters in the bounding

box of the pair.

When deciding to pair two letters together, we have 2 options: either both letters were not assigned a

region yet, or one of them was already grouped with other letters. If both are unassigned, all we need to

do is to declare a new region and assign them to it. Otherwise, we need to check if adding one letter to

the region of the other is reasonable. In my implementation, a merge is reasonable if the pixel count of

the letters in the region and the pixel count of the letter to add divided by the size of the bounding box

of all the letters combined is not bellow some threshold. This will ensure the region of text will not have

loose ends, and will form a “box” of text. This approach is a bit different than the approach in the

original algorithm, which gathers letters together into chains. I will discuss this in the next section.

Finally, regions with less than 3 letters are discarded.

The flow char of the algorithm is shown in figure 3. The implementation is up to the text aggregation

phase.

Figure 3

The application

Overview

The SWT Text Detector application is designed locate and mark the regions of an image that are

suspected to contain text. The application receives an RGB image, and whether the text to search is

light-on-dark or vice-versa. It returns an image of the same size as the input image, where the pixels of

each detected text region are marked. The value of each pixel is the ID number of the region it belongs

to, where 0 means it does not belong to any region. I deliberately wanted to show the letters inside each

region, and not just the bounding box of the region, in order to see the efficiency of the algorithm

better.

The implementation of the application contains several parts, as discussed in the previous section:

� The stroke width transform: edge detection and stroke width calculation.

� Removing stray lines from the SW map.

8

� Finding letter candidates: finding the connected components and detecting the components

with the features of a letter.

� Grouping the letters into regions of text.

Improvement Approaches

My implementation differs from the original algorithm in several aspects:

First of all, I added another rule to the ‘discovery of letter candidates’ phase. This rule, as described in

the previous section, filters out components which spread over too much space, compared to the

amount of pixels in them (their size). For example:

Figure 4 - The component clearly spreads over too much space,

even though other features might fit the features of a character

I added a similar rule to the ‘aggregation of letters’ phase, which restricts the pairing of letters. This is to

ensure that two letters are combined into the same region only if the bounding box surrounding them is

of reasonable size compared to the size of the letters. For example:

Figure 5 - The left mage shows two components that will not be grouped, while

the right image shows two components that can be grouped together

In attempt to improve the results, I added another step to the algorithm. I noticed that many times the

SW-map contains lines that connect different components of the image. These lines may connect far

away elements, or element close by. The most damage, as I detected, is done by connecting close

elements. This way, letters that are close to each other will be grouped into one component. This forces

the thresholds for character and region recognition to be less strict, allowing more noise to appear in

the result.

The added step goes over each pixel in the SW-map and examines its neighborhood: if it contains 3 or

less pixels with the same stroke width label, the label is removed from that pixel. This way, single lines

and stray pixels are removed, and fewer components will be falsely grouped together. Experimental

results can be seen in the next section.

9

Another difference I added to the SWT algorithm was reducing some restrictions to the separation

algorithms. Instead of separating letter candidates into lines, I separated them into regions of text. The

incentive for this change was the desire to detect text in many directions, including curvy text, and not

simply assume a certain orientation for all the text in the image.

Experiments and Results

In this section I will show the outcome of the SW Text Detector on a set of images, and compare the

results to the different approaches I discussed previously.

Results

(1)

Figure 6 - The different steps of the algorithm

10

(2)

(3)

(4)

(5)

11

(6)

(7)

The strengths of the algorithm:

As you can see from in the examples displayed, the SW Detector can detect letters of different

languages: English, Hebrew, Arabic etc. (5).The text can be of varying sizes (2, 3), and of different

orientation (including curvy text - 4). Even handwriting can be detected (6, 7).

The weakness of the algorithm:

In certain cases, some noise can be detected in the result (1, 4). This usually happens when there is

foliage in the image. The features of foliage resemble those of letters, and might produce a false

detection of letter candidates. This can be seen in the letters of examples 1, 2, and 6.

Also, the text detector does not handle round and curved letters as well. For example, in (6) the cursive

letters were not recognized, as opposed to the print letters. Similarly, curved lines of text produce weak

results (4). This varies according to the level of strictness in the ‘grouping letters into regions’ phase. If

we relax the thresholds, more letters will be grouped together, yet more noise will appear as well.

12

Another weakness I discovered was that small and close letters tend to be grouped together in the SW

labeling phase. Since a group of letters behaves differently than a single letter, these groups may be

dismissed in the ‘finding letter candidates’ phase. For example, in (5), the word ROADS was not

recognized since the letters ‘ROA’ were labeled as a single component, and their features together differ

from the features of a single letter. Although D and S were recognized, we dismissed them since we

expect a region to contain at least 3 letters.

Figure 7 - The R, O and A were grouped together during the SW labeling phase

In attempt to avoid these occasions, I added the phase where stray pixels are removed from the SW

map. In this case, for example, the pixel connecting R and O, and the pixel connecting O and A should be

removed, allowing us to recognize the letters ROA as three separate letters.

Comparing the Results

Next I will compare the results of the original algorithm with the results of the algorithm with the

addition of the phase which removes stray pixels and lines.

Image Original SWT SW Detector version

13

Image Original SWT SW Detector version

14

As you can see from the results, the improvement can be found in the ‘Finding Letter Candidate’ phase

and in the ‘Grouping Letters into Regions’ phase. It results in both discovering letters that were

previously dismissed, and dismissing noise.

Although an improvement can be detected for many images, the effect was not as vast as I anticipated.

For example, for some images the detector detected more noise with my additions than without.

Future work

In the future I would like to improve the labeling algorithm. After examining the different steps of the

algorithm, I realized that the Achilles’ heel is the connected component implementation. A better

labeling method of components could improve the detection of characters and will allow us to use

harsher thresholds. This way, we could get better results for circular text, which tends to be dismissed as

noise due to the grouping of the letters. This would also allow us to identify curvy letters better, such as

Arabic fonts or cursive handwriting.

References

[1] B. Epshtein, E. Ofek, Y. Wexler. Detecting Text in Natural Scenes with Stroke Width Transform.

CVPR, 2010.

[2] L. Neumann, J. Matas. A method for text localization and recognition in real-world images. ACCV,

2010.

Implementation Details

For the labeling process I used an open source labeling function ‘label’, and updated it to comply with

the Stroke Width Transform algorithm. The license is included with the code. The source:

http://www.mathworks.com/matlabcentral/fileexchange/26946-label-connected-components-in-2-d-

array.

For the rest of the implementation, I used [1] as a reference.

15

Appendix - User Manual

To run the application, open the Matlab file ‘runSWTTextDetector’ and execute it (make sure

the current folder is the directory of the application).

� Select an image by clicking ‘Browse’. The supported formats are JPEG and PNG.

� Select which type of text you wish to detect: dark text on light background or vice versa.

� Press ‘Detect text!’ in order to start the process of detection.

The output of the program is two figures: one figure will display the original image, and the other

figure will display the result.

Enjoy ☺

text detection in natural scenes with stroke width...

Documents