text detection in natural scenes with stroke width...
TRANSCRIPT
1
Text Detection in Natural Scenes with
Stroke Width Transform
Gili Werner
Ben Gurion University, Israel
February, 2013
Abstract
My project aims at detecting text segments in an image of a natural scene, by using an
enhanced version of the Stroke Width Transform [1]. The application receives an RGB image to
search in, and returns a new image where the discovered text segments are marked. Due to the
features of the SWT, the resulting system is able to detect text regardless of its scale, direction,
font and language.
2
Table of Contents
Introduction ……………………………………………………………………………………………………………………………………. 3
Problem Domain and Assumptions …………………………………………………………………………………………………. 3
The Stroke Width Transform …………………………………………………………………………………………………………… 4
The Application ……………………………………………………………………………………………………………………………….. 7
Experiments and Results …………………………………………………………………………………………………………………. 9
Future Work ……………………………………………………………………………………………………………………………………. 14
References ………………………………………………………………………………………………………………………………………. 14
Appendix – User Manual …………………………………………………………………………………………………………………. 15
3
Introduction
Detecting text in a natural scene is an important part of many Computer Vision tasks. For example, the
performance of optical character recognition (OCR) algorithms can be highly improved by first
identifying the regions of text in the image.
Text detection in natural scenes is a highly researched field, and there are numerous approaches for
solving this problem. However, most text detection schemes restrict the user to specific languages, scale
and direction of the text. Yet, in a natural scene, we may not want make such assumptions and restrict
the results accordingly. There seems to be a tradeoff between the amount of restrictions we apply and
the quality of the result. The more we limit our search, the less noise we encounter.
In this project I attempted to create a powerful and reliable tool for detecting text regions in an image,
by using the Stroke Width Transform (SWT). The approach of SWT is grouping pixels together in an
intelligent way, instead of looking for separating features of pixels. By using the SWT I was able to relax
the assumptions I mentioned above, and still maintain a high quality result. My goal was to implement
and improve the algorithm defined in [1], so most of the text in a natural image will be discovered, with
as little noise as possible. A description of the different methods I attempted for improving the
algorithm can be found in the section ‘The Application’.
Problem Domain and Assumptions
My application expects to receive an image of a natural scene, as opposed to scanned pages, faxes and
business cards. It can receive any image of the JPEG format. The standard size of the tested image is
800x600 (for larger images the application will take a bit longer). The image does not have to be of top
quality (examples will be given in the Experiments and Results section), but the text should not be
blurry.
The text in the image can be of a variety of sizes, however, I assume minimal height and width of 13
pixels and a maximal height/width of 300 pixels.
Also, the recognition is independent of the language of the text and the direction of it. Curved text will
be more difficult to detect, yet it is still possible.
Furthermore, the text detected in an execution can either be light text on a dark background, or dark
text on a light background. This is due to the features of the Stroke Width Transform, as described later
on.
The result displays the recognized characters grouped into a region. These regions may have several
letters missing in them; however, as long as the bounding box of the region contains them, the omitting
is not as problematic.
4
The Stroke Width Transform
In this section I will describe the Stroke Width Transform algorithm as it is presented in [1], with several
additions and enhancements. These additions will be discussed in further extent in the next section ‘The
Application’.
The algorithm receives an RGB image and returns an image of the same size, where the regions of
suspected text are marked. It has 3 major steps: the stroke width transform, grouping the pixels into
letter candidates based on their stroke width, and finally, grouping letter candidates into regions of text.
The Stroke Width Transform
A stroke in the image is a continuous band of a nearly constant width. An example of a stroke is shown
in figure 1(a). The Stroke Width Transform (SWT) is a local operator which calculates for each pixel the
width of the most likely stroke containing the pixel.
Figure 1
First, all pixels are initialized with ∞ as their stroke width. Then, we calculate the edge map of the image
by using the Canny edge detector. We consider the edges as possible stroke boundaries, and we wish to
find the width of such stroke. If p is an edge pixel, the direction of the gradient is roughly perpendicular
to the orientation of the stroke boundary. Therefore, the next step is to calculate the gradient direction
gp of the edge pixels, and follow the ray r=p+n*gp (n>0) until we find another edge pixel q. If the
gradient direction gq at q is roughly opposite to gp, then each pixel in the ray is assigned the distance
between p and q as their stroke width, unless it already has a lower value. If, however, an edge pixel q is
not found, or gq is not opposite to gp, the ray is discarded. In order to accommodate both bright text on
5
a dark background and dark text on a bright background, we need to apply the algorithm twice: once
with the ray direction gp and once with –gp.
After the first pass described above, pixels in complex locations might not hold the true stroke width
value (figure 2(b)). For that reason, we will pass along each non-discarded ray, where each pixel in the
ray will receive the minimal value between its current value, and the median value along that ray. (In the
original algorithm, the pixels are assigned the median value, yet from my experiments, I got better
results when I took the minimum).
Figure 2
Removing single lines from the SW-Map
In order to improve the results, I added another step to the algorithm, whose purpose is to improve the
character separation. Many times the letters in the image connect with each other, and the algorithm
recognizes a group of characters as a single component. Since the features of a bundle of characters do
not necessarily conform to the features of a single character, these components might be rejected in the
next phase. When reviewing the steps of the algorithm, I noticed that the SW operator returns many
stray lines that connect letters together. After removing such lines, these letters are no longer
considered part of the same component. Therefore, I added a step where the algorithm goes over each
pixel, and if its neighborhood does not contain enough pixels from its component, that pixel is removed
from the component. The results from this addition will be demonstrated in the section ‘Experiments
and Results’.
Finding letter candidates
We now have a map of the most likely stroke-widths for each pixel in the original image. The next step is
to group these pixels into letter candidate. This will be done by first grouping pixels with similar stroke
width, and then applying several rules to distinguish the letter candidates.
The grouping of the image will be done by using a Connected Component algorithm. In order to allow
smoothly varying stroke widths in a letter, we will let two pixels to be grouped together if their SWT
ratio is less than 3.0.
6
Now we must detect the connected components which can pass as letter candidates, by applying a set
of fairly flexibly rules. These rules are as follows:
� The variance of the stroke-width within a component must not be too big. This helps with
rejecting foliage in natural images, which are commonly mistaken for text.
� The aspect ratio of a component must be within a small range of values, in order to reject long
and narrow components.
� The ratio between the diameter of the component and its median stroke width to be less than a
learned threshold. This also helps reject long and narrow components.
� Components whose size is too large or too small will also be ignored. This is done by limiting the
length, width, and pixel count of the component.
� In addition to these rules so far, I added another rule which helped me eliminate much noise.
This rule state that the ratio between the pixel count of the component and the amount
of pixels in the bounding box of the component should be within a bounded range. This
rejects components that spread over a large space, yet have a small pixel count, and
components which cover most of their bounding box.
The thresholds used we initially taken from the Stroke Width Transform description [1], and were
updated slightly according to the results.
The remaining connected components are considered letter candidates, and are now to be aggregated
into regions of text.
Grouping letter candidates into text regions
Since single letters are not expected to appear in images, we will now attempt to group closely
positioned letter candidates into regions of text. This filters out many falsely-identified letter candidates,
and improves the reliability of the algorithm results.
Again, we will use a small set of rules to group letters together into regions of text. These rules will
consider pairs of letters, and are as follows:
� Two letter candidates should have similar stroke width. For this reason we limit the ratio
between the median stroke-widths to be less than some threshold.
� The ratio between the heights of the letters and between the widths of the letters must not
exceed 2.5. This is due to capital letters next to lower case letters.
� The distance between letters must not exceed three times the width of the wider one.
� Characters of the same word are expected to have a similar color; therefore we compare the
average color of the candidates for pairing.
7
� Again, I added another rule which restricts the pixels count of the pair of letters in the bounding
box of the pair.
When deciding to pair two letters together, we have 2 options: either both letters were not assigned a
region yet, or one of them was already grouped with other letters. If both are unassigned, all we need to
do is to declare a new region and assign them to it. Otherwise, we need to check if adding one letter to
the region of the other is reasonable. In my implementation, a merge is reasonable if the pixel count of
the letters in the region and the pixel count of the letter to add divided by the size of the bounding box
of all the letters combined is not bellow some threshold. This will ensure the region of text will not have
loose ends, and will form a “box” of text. This approach is a bit different than the approach in the
original algorithm, which gathers letters together into chains. I will discuss this in the next section.
Finally, regions with less than 3 letters are discarded.
The flow char of the algorithm is shown in figure 3. The implementation is up to the text aggregation
phase.
Figure 3
The application
Overview
The SWT Text Detector application is designed locate and mark the regions of an image that are
suspected to contain text. The application receives an RGB image, and whether the text to search is
light-on-dark or vice-versa. It returns an image of the same size as the input image, where the pixels of
each detected text region are marked. The value of each pixel is the ID number of the region it belongs
to, where 0 means it does not belong to any region. I deliberately wanted to show the letters inside each
region, and not just the bounding box of the region, in order to see the efficiency of the algorithm
better.
The implementation of the application contains several parts, as discussed in the previous section:
� The stroke width transform: edge detection and stroke width calculation.
� Removing stray lines from the SW map.
8
� Finding letter candidates: finding the connected components and detecting the components
with the features of a letter.
� Grouping the letters into regions of text.
Improvement Approaches
My implementation differs from the original algorithm in several aspects:
First of all, I added another rule to the ‘discovery of letter candidates’ phase. This rule, as described in
the previous section, filters out components which spread over too much space, compared to the
amount of pixels in them (their size). For example:
Figure 4 - The component clearly spreads over too much space,
even though other features might fit the features of a character
I added a similar rule to the ‘aggregation of letters’ phase, which restricts the pairing of letters. This is to
ensure that two letters are combined into the same region only if the bounding box surrounding them is
of reasonable size compared to the size of the letters. For example:
Figure 5 - The left mage shows two components that will not be grouped, while
the right image shows two components that can be grouped together
In attempt to improve the results, I added another step to the algorithm. I noticed that many times the
SW-map contains lines that connect different components of the image. These lines may connect far
away elements, or element close by. The most damage, as I detected, is done by connecting close
elements. This way, letters that are close to each other will be grouped into one component. This forces
the thresholds for character and region recognition to be less strict, allowing more noise to appear in
the result.
The added step goes over each pixel in the SW-map and examines its neighborhood: if it contains 3 or
less pixels with the same stroke width label, the label is removed from that pixel. This way, single lines
and stray pixels are removed, and fewer components will be falsely grouped together. Experimental
results can be seen in the next section.
9
Another difference I added to the SWT algorithm was reducing some restrictions to the separation
algorithms. Instead of separating letter candidates into lines, I separated them into regions of text. The
incentive for this change was the desire to detect text in many directions, including curvy text, and not
simply assume a certain orientation for all the text in the image.
Experiments and Results
In this section I will show the outcome of the SW Text Detector on a set of images, and compare the
results to the different approaches I discussed previously.
Results
(1)
Figure 6 - The different steps of the algorithm
10
(2)
(3)
(4)
(5)
11
(6)
(7)
The strengths of the algorithm:
As you can see from in the examples displayed, the SW Detector can detect letters of different
languages: English, Hebrew, Arabic etc. (5).The text can be of varying sizes (2, 3), and of different
orientation (including curvy text - 4). Even handwriting can be detected (6, 7).
The weakness of the algorithm:
In certain cases, some noise can be detected in the result (1, 4). This usually happens when there is
foliage in the image. The features of foliage resemble those of letters, and might produce a false
detection of letter candidates. This can be seen in the letters of examples 1, 2, and 6.
Also, the text detector does not handle round and curved letters as well. For example, in (6) the cursive
letters were not recognized, as opposed to the print letters. Similarly, curved lines of text produce weak
results (4). This varies according to the level of strictness in the ‘grouping letters into regions’ phase. If
we relax the thresholds, more letters will be grouped together, yet more noise will appear as well.
12
Another weakness I discovered was that small and close letters tend to be grouped together in the SW
labeling phase. Since a group of letters behaves differently than a single letter, these groups may be
dismissed in the ‘finding letter candidates’ phase. For example, in (5), the word ROADS was not
recognized since the letters ‘ROA’ were labeled as a single component, and their features together differ
from the features of a single letter. Although D and S were recognized, we dismissed them since we
expect a region to contain at least 3 letters.
Figure 7 - The R, O and A were grouped together during the SW labeling phase
In attempt to avoid these occasions, I added the phase where stray pixels are removed from the SW
map. In this case, for example, the pixel connecting R and O, and the pixel connecting O and A should be
removed, allowing us to recognize the letters ROA as three separate letters.
Comparing the Results
Next I will compare the results of the original algorithm with the results of the algorithm with the
addition of the phase which removes stray pixels and lines.
Image Original SWT SW Detector version
13
Image Original SWT SW Detector version
14
As you can see from the results, the improvement can be found in the ‘Finding Letter Candidate’ phase
and in the ‘Grouping Letters into Regions’ phase. It results in both discovering letters that were
previously dismissed, and dismissing noise.
Although an improvement can be detected for many images, the effect was not as vast as I anticipated.
For example, for some images the detector detected more noise with my additions than without.
Future work
In the future I would like to improve the labeling algorithm. After examining the different steps of the
algorithm, I realized that the Achilles’ heel is the connected component implementation. A better
labeling method of components could improve the detection of characters and will allow us to use
harsher thresholds. This way, we could get better results for circular text, which tends to be dismissed as
noise due to the grouping of the letters. This would also allow us to identify curvy letters better, such as
Arabic fonts or cursive handwriting.
References
[1] B. Epshtein, E. Ofek, Y. Wexler. Detecting Text in Natural Scenes with Stroke Width Transform.
CVPR, 2010.
[2] L. Neumann, J. Matas. A method for text localization and recognition in real-world images. ACCV,
2010.
Implementation Details
For the labeling process I used an open source labeling function ‘label’, and updated it to comply with
the Stroke Width Transform algorithm. The license is included with the code. The source:
http://www.mathworks.com/matlabcentral/fileexchange/26946-label-connected-components-in-2-d-
array.
For the rest of the implementation, I used [1] as a reference.
15
Appendix - User Manual
To run the application, open the Matlab file ‘runSWTTextDetector’ and execute it (make sure
the current folder is the directory of the application).
� Select an image by clicking ‘Browse’. The supported formats are JPEG and PNG.
� Select which type of text you wish to detect: dark text on light background or vice versa.
� Press ‘Detect text!’ in order to start the process of detection.
The output of the program is two figures: one figure will display the original image, and the other
figure will display the result.
Enjoy ☺