unsupervised clustering for nontextual web document ...rudys/arnie/som-web-clustering.pdf ·...
TRANSCRIPT
www.elsevier.com/locate/dsw
Decision Support Systems 37 (2004) 377–396
Unsupervised clustering for nontextual web
document classification
Samuel W.K. Chan*, Mickey W.C. Chong
Department of Decision Sciences and Managerial Economics, The Chinese University of Hong Kong, Hong Kong, China
Received 19 January 2001; accepted 2 July 2002
Available online 26 March 2003
Abstract
While the breath of vocabulary used in long documents may mislead the traditional keyword-based retrieval systems, the
demands for techniques in nontextual Web classification and retrieval from a large document collection are mounting. Only a
few prototype systems have attempted to classify hypertext on the basis of nontextual elements in order to locate unfamiliar
documents. As a result, a large portion of Web documents having pictorial information in nature is far beyond the reach of most
current search engines. In this research, we devise a novel quantitative model of nontextual World Wide Web (WWW)
classification based on image information. An intelligent content-sensitive, attribute-rich image classifier is presented. An image
similarity measure is used to deduce the likelihood among images. Different image feature vectors have been constructed and
evaluated. Evaluation shows images judged to be similar by human form interesting clusters in our unsupervised learning.
Comparison with other clustering technique, such as Hierarchical Agglomerative Clustering (HAC), demonstrates that our
approach is found useful in content-based image information retrieval.
D 2003 Elsevier B.V. All rights reserved.
Keywords: Unsupervised clustering; Image classification; Neural networks
1. Introduction capacities, comprehensive sources of text and image
Last decade has witnessed one of the dramatic
progresses in the area of communication technology.
In respect to the well-established information super-
highways, researchers have found themselves at the
centre of an information revolution ushered in by this
Internet age. World Wide Web provides a fast and
efficient channel to disseminate information. With the
advent of relative economic and large online storage
0167-9236/03/$ - see front matter D 2003 Elsevier B.V. All rights reserve
doi:10.1016/S0167-9236(03)00035-6
* Corresponding author.
E-mail address: [email protected] (S.W.K. Chan).
can be stored and made available. It is also hypothe-
sized that images, as one of the major nontextual
media, will have taken up more than 70% of Internet
traffic in the next decade [19]. In order to scale up the
effectiveness of Information Retrieval (IR) in this
cyberspace, the stress on content using heterogeneous
knowledge in enhancing retrieval functionality is
emphasized [2,5]. Current IR can be regarded as a
combination of traditional techniques with advanced
browsing facilities. However, they support neither
content sensitive nor classification of nontextual infor-
mation. It is indisputable that the challenges in clas-
d.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396378
sification of nontextual documents and the demands
for the content-based image retrieval, particularly in
Web applications, are both mounting.
In recent years, many content-based image retrieval
systems have been developed, such as QBIC system
[17] and WBIIS system [26], to name a few. The
common technique for content-based image retrieval
systems is to extract a signature for every image based
on its pixel values, color or shape, and to define a
classification mechanism for image comparison. Two
major classes of signatures are commonly found. First,
with the increasing availability of devices supporting
acquisition and visualization of color images, a grow-
ing attention is being focused on color as a key feature
in the signature to characterize the content of archived
images. Color histogram technique is a simple but low-
level method which provides basic indexing and
retrieval tasks [8,25]. However, its major tradeoff is
that spatial features are completely lost and the spatial
relations between parts of an image cannot be thor-
oughly utilized. On the other hand, different from color
histogram techniques, region-based image classifica-
tion systems attempt to overcome the drawbacks in
color-based approach by representing images at object
level. As object shape is one of the important features of
images, a number of shape representations have been
used in many image classification systems. Moment
invariant and other simple features, such as line, edge,
region and area, are used for shape representation [17].
In this paper, we advocate the integration of
various image features into a high-level cognition
classification model using a self-organization map.
We describe a framework that encompasses both low-
level features, including color, line segment and
region fragment, without demanding any particular
image knowledge source to be dominant. A new
approach to image feature vector formation and an
image similarity measure which is much more appro-
priate for content-based image clustering are pre-
sented. Different levels of abstraction and high-level
features, such as the image domain knowledge based
on the image templates stored in our image repository,
are employed. In addition, since different users may
perceive image patterns of the same objects differently
and, consequently, may categorize the same objects
into different groups, we provide an interactive user
interface to capture the user preferences on different
image features. These high-level user preferences with
a set of image templates encapsulated with a wide
range of low-level image features are most appropri-
ately viewed as complementary in our domain of
nontextual Web document classification. In the next
section, we first review the related work and compare
our approach with the existing ones. The overview of
system architecture is then discussed in Section 3. The
technical details in extracting image signatures are
described in Section 4. Details on how image pads can
be compared and similarity measures can be deduced
are also explained. In Section 5, we introduce a
clustering algorithm using neural networks. We em-
ploy a self-organization map with associated learning
scheme for image classification. Experimental results
on different composition of image features are dis-
cussed in Section 6 with a comparison with the
traditional Hierarchical Agglomerative Clustering
(HAC) technique, followed by a conclusion.
2. Related work
Cortelazzo et al. [9] describe a trademark shape
description based on chain-coding and string-matching
technique. In their system, chain codes are not normal-
ized and shape distance is measured using stringmatch-
ing. More recently, a knowledge-based image analyzer
is proposed and implemented using object-oriented
approach [3,4]. Their system captures the visual knowl-
edge and has been applied in diagnosing medical
images in cervical cancer cells. Working toward the
text-based classification, Kohonen group has created
and maintained a WEBSOM server that demonstrates
its ability to categorize several thousand Internet news-
group items [12]. The SOM-generated categories are
found to be comparable to those generated by human
subjects [5,20]. Although some other innovative
approaches have been proposed [10,27], their main
concern emphasizes on the power in image understand-
ing without regarding the iconic characteristic and the
relatively low resolution of the Web images. Most of
them rely solely on one or a few image attributes. There
is an indication on the necessity of generating a better
image signature and similarity measure for a sophisti-
cated image information retrieval.
Zhu and Chen [30] report an aerial photograph
image retrieval system. The system supports three
main functionality, including similarity analysis,
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396 379
region segmentation and image categorization. The
system successfully integrates image-processing tech-
niques, such as the Gabor filter with information
analysis algorithms like the Self-Organization Map
(SOM). The system also enables users to specify their
queries by clicking on images and translates their high-
level queries into low-level features. However, they
also accept that one major drawback in their system is
the insufficiency of image features, since they can only
handle one low-level feature, texture, in their system.
Similarly, Ma and Manjunath [16] propose a system
which includes texture feature extraction, image seg-
mentation, grouping and a texture thesaurus model for
search and indexing. Novel features of the system
include a fast image segmentation scheme based on
texture edge flow and the use of a hybrid neural
network algorithm for developing the image texture
thesaurus. Similar to the work of Zhu and Chen [30],
they have proposed the Gabor texture feature extrac-
tion scheme and provided a similarity measure based
on the Kohonen map. In addition, similar to parsing
text documents using a dictionary or thesaurus, the
information within images can be classified and
indexed via the use of a texture thesaurus. The texture
thesaurus is domain dependent and can be designed to
meet the particular need of a specific image data type
by exploring the training data. Although the system
only relies on one and only one image feature, texture,
it is an undeniable achievement of this work that it has
brought to light a bold new idea for using a texture
thesaurus for browsing a domain specific images.
The QBIC [11] and the Photobook [21] are notable
examples of image content-based retrieval systems
which make use of several image attributes. The key
image features included in their systems are color,
texture, shape and motion of images. In addition, a
data model is proposed in order to distinguish
between scenes and objects. In QBIC, it allows a user
to compose a query based on a variety of different
visual properties, such as color, shape and texture,
which are semi-automatically extracted from images.
It partially used the R*-tree as an image indexing
method. In the similar vein, the VisualSeek [23] is
another content-based image query system that ena-
bles querying by color regions and spatial layout.
They devised an image similarity function which
contains both color feature and spatial components.
However, the devised index structures of each image
features were not integrated into a unique content-
based indexing. The VIPER [18] is another image
retrieval system that employs both the color and
spatial information. First, they extract a set of domi-
nant colors and then the spatial information bounded
by the dominant colors is derived. Similarity is meas-
ured in terms of color and spatial information. Two
images can be classified as similar if they have a few
clusters with the same color that fall in the same image
space. Inspired by the above approaches, our system
takes a further step in unsupervised clustering of non-
textual Web document classification. The following
features are the major characteristics of our system.
First, unlike most existing content-based image re-
trieval (CBIR) systems which solely focus either on
accurate and speedy retrieval, we concentrate on
images and icons, particularly in the Web documents,
which are usually smaller in size. It is not too difficult
to imagine that human tends to perceive or classify
image objects instantly if the images are not big
enough. The main reason behind is that human per-
ception relies on different levels of image resolution.
Our concept of image pads, which captures both high
level of abstraction supported by a set of low-level
image details, enable our system to support efficient
similarity search in the Web applications. Second,
while there is a lot of image retrieval systems which
are based on low-level image features, users may
perceive patterns on the same objects differently and,
consequently, may have different expectation under
the same queries. In our system, in addition to imple-
menting a more generic displacement insensitive
measure in image comparison with a wide variety of
low-level image features, we allow users to incorpo-
rate their own preferences in each image attributes so
that they can tailor-make their own profiles in image
perception. Our system supports three major function-
ality, including image signatures extraction based on a
variety of low level image features, similarity analysis
and categorization under user preferences using self-
organization map.
3. System architecture
In this section, we present four main types of
application objects in order to provide the intelligent
classification capabilities. The four major types in-
Fig. 2. Pyramidal image representation with each image pad having
16� 16 pixels.
Fig. 1. System architecture of the image classifier.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396380
clude image partition objects, feature extraction ob-
jects, classification objects and user interface objects
as shown in Fig. 1.
In order to have an efficient image classifier, we
first partition all the images which extract from the
World Wide Web into a quad-tree image representation
in our image partition objects. The quad-tree concept
which is a pyramidal image partition has been intro-
duced as a spatial decomposition technique, without
regard to the representation of gray-scale intensities at
each level. It is on the basis of successive subdivision
of the image into quadrants hierarchically. Each parti-
tioned region of an image is further subdivided into
four subregions. Mathematically speaking, a pyrami-
dal image partition P is the set of images P={A1, A2,
A3, . . ., An}, in each of which image intensity may be
denoted as a function with three arguments: a level
designator Ai and the two spatial indices of the ith
image. That is image intensity = f (m, n, i), wherem and
n are spatial indices and i is the level in the pyramid P.
Note that the most obvious spatial information exists in
the images can be found from the image pads at the
upper half of the pyramid, i.e., with significantly lower
resolution, while significant details are obtained at the
lower half. Using a simple averaging approach, a
sequence of 2-D arrays of varying resolution can be
created as shown in Fig. 2. In our experiment, we
partition all the input images from two different levels
of resolution A1, A2 with total 20 image pads each
having 16� 16 pixels.
Before proceeding to the image classification ob-
jects using the Self-OrganizingMap (SOM), we restrict
ourselves, among the many possible features that can
be extracted from an image, to ones that are global and
low-level image signatures. The image signatures
involve color, edge, region and texture. A novel algo-
rithm is devised and implemented in order to calculate
Fig. 3. Image feature extraction objects.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396 381
the similarity between two image pads. The algorithm
is not on the basis of Euclidean distance which then
provides a nondisplacement sensitive measure in
image comparison. The similarity algorithm is applied
to various color-based, edge-based and region-based
image pads as shown in Fig. 3.
Fig. 4. Image similarity measu
By using an image similarity measure method
embedded in the feature extraction objects, each input
image pad in the partition P is compared with the
corresponding image pad in the n template images as
shown in Fig. 4. The comparison will be repeated for
every image pad in order to identify the image
re between image pads.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396382
similarity measures for color-based, edge-based and
region-based images. As a result, there will have n
similarity values which reflect the likelihood among
the images in that particular image pad. Each of these
image signatures reflects the different aspect of the
image. The objective of the classification objects is to
cluster the image on the basis of the image signatures
as well as the image feature ranking upon user
selection. The image feature ranking allows users to
modify the composition of the image vectors in order
to fine-tune the relative saliency in the human image
perception through the user interaction objects. By
using a self-organizing feature map (SOM) model
proposed by Kohonen [13], the image signatures
which usually have more than a few hundred dimen-
sions are then clustered in the classification objects.
The classification objects analyze the image signa-
tures and yield connection weights between layers in
the map. According to the property of the SOM as
thoroughly discussed in Section 4, the connection
weights capture the characteristics of the signatures
of the input images. Since SOM is on topological
feature-preserving basis and compresses the input
signature vector in connection weights, of which
greatly reduce the size of the input, it is more
sophisticated in dimension reduction particularly for
the input vectors having a large dimension such as in
image classification. As a result, similar images will
form a subgroup on the map according to their
similarity. Further discussion on the detailed architec-
ture and the algorithm used in the image classifier can
be found in the following sections.
Fig. 5. Java class in calculating the HSV value in each image pad.
4. Image signatures extraction
Classification of images by color is generally
accounting for the fact that color is the salient feature.
In order to quantify the specification of color, we
employ the concept of color spaces which define as a
formal method of representing the visual sensations of
color. A color is represented by a three-dimensional
vector corresponding to a position in a color space.
Research has been done with a number of different
color spaces [6]. The well-known color models are
hardware oriented, such as RGB (red, green, blue)
model for color monitors and a broad class of color
video cameras; the CMY (cyan, magenta, yellow)
model for color printers; and the YIQ model, which
is the standard for color TV broadcast. However, all
these models are not suitable for classification and
retrieval applications. User-oriented models, such as
HSV (Hue, Saturation, Value) and HSI (Hue, Satura-
tion, Intensity), are the most popular models. They
allow for the manipulation of a color’s features in the
same manner in which humans perceive color. In this
research, the HSV space which is a bijection of the
red–green–blue (RGB) space is adopted. HSV is
considered more appropriate than others since it
separates the color components (HS) from the lumi-
nance component (V) and is less sensitive to illumi-
nation changes. The hue describes the actual wave-
length of the color, such as blue versus green, while
the saturation is a measure of how pure a color is. The
hue and saturation components are intimately related
to the way in which human beings perceive color
while value/intensity component is decoupled from
the color information in the image. At the same time,
distances in HSV space correspond in a more con-
sistent manner to the perceptual differences in color
than in the RGB space. The color information ex-
tracted in the rough classification using color group-
ing exhibits more or less the overall structure of the
image. In our image feature extraction objects, the
first step in extracting the color attributes is to
calculate the RGB values of each pixel in the image
pads, and then convert the values to its HSV compo-
nents. The pseudo-code in Java class for calculating
the HSV values is shown in Fig. 5.
Fig. 6. Java class in calculating the DCT coefficients in image pads.
Table 1
Algorithm of image similarity measure
INPUT Image A and Image B gray-level matrices
DETERMINE size of window, N
COMPUTE difference matrix based on A & B
SEARCH the intensity center as described in Eqs. (1) and (2)
INITIALIZE sum
DO WHILE pixel (i, j) within the window
COMPUTE
Fði; jÞp Bgði; jÞ1þ k
fk2ði� lrÞ2 þ k2ð j� lsÞ2 þ Bg2ði; jÞg12
CALCULATE sump sum+ f 2 (i, j)
ENDDO
OUTPUT
ISMðA;BÞp 1
N
ffiffiffiffiffiffiffiffisum
p
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396 383
Other than the color information encoded in the
image, Discrete Cosine Transform (DCT) coefficients
are also extracted from each image pad in order to
compare the texture similarity of images. Loosely
speaking, DCT reflects the overall structure of the
images and provides details regarding their spatial
frequency content. The low frequency components
can characterize the coarseness of the image. It is also
the reason why the low frequency components of the
image signal will only be emphasized in our DCT
coefficients extraction while the high frequency image
information, such as edges and lines, will be tackled
in the later edge extraction process. Our DCT methods
start by breaking the images into 16� 16 pixel blocks
or 32� 32 pixel blocks where all the pixel blocks are
analyzed separately in the feature extraction objects.
After the transformation, each block is compressed
into 4 bytes. Since the DCT is lossy operation, the
resulting output will no longer be identical in their
information carrying roles before the transformation.
The larger the coefficients, the more alike the trans-
formed image are to the original image. As the result,
the coefficients reflect the image textual. As described
in Fig. 6 below, the DCT coefficients are calculated
for all the image pixels in the image pad. The detailed
pseudo-code for analyzing the textual analysis using
DCT is shown as follows.
In addition, most image information used to
describe an image is closely related to the similarity
measure. There is a lot of on-going work in developing
new measures for comparing images for various
objectives in image retrieval and evaluation [29].
However, most of them rely on the Euclidean distance
between two sets of object pixels as a means of
evaluating the numerical difference between two
images. Different from other Euclidean distance meas-
ure in image comparison, we employ an Image Sim-
ilarity Measure (ISM), which makes use of intensity
center of an image as the key parameter for computing
the similarity values [7]. The algorithm provides a
displacement insensitive measure in image compari-
son. In order to calculate the ISM, an intensity center
l=(lr, ls) of an image is first defined, such that,
minXlr
i¼1
Xmj¼1
Bgði; jÞ �Xmi¼lr
Xmj¼1
Bgði; jÞ !
ð1Þ
minXmi¼1
Xls
j¼1
Bgði; jÞ �Xmi¼1
Xmj¼ls
Bgði; jÞ !
ð2Þ
where Bg(i, j) =Aga(i, j)� gb(i, j)A is the absolute
difference between the two images a and b, calculated
pixel by pixel. The algorithm takes two image matrices
and calculates the similarity between two image matri-
ces as shown in Table 1. The input to the algorithm is
the two images that are being compared. The differ-
ence image matrix is calculated for any pixel in image
A relative to the same pixel in image B. Clustering is
performed, based on the difference matrix, by compar-
ing the surrounding pixels within a predefined window
with the intensity center (lr, ls) in order to get the
minimum distance between the compared pixels. This
process is repeated until all the pixels of the difference
image within the predefined window have been pro-
Table 2
Possible image attributes in the heterogeneous image signature
database
Image attributes Property values
Color Hue, saturation, value
Texture DCT coefficients
Edge Similarity with templates
Region Similarity with templates
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396384
cessed. From the average minimal pixel distance, the
final dissimilarity for the whole image is calculated.
In our simulation, two different types of gray-scale
images are used to compare the similarity measures
with a set of predefined image templates. Both edge
images and intensity-based region images are
involved. Edges provide useful information about an
image. Edge detection methods are used as a first step
to identify complex object boundaries by marking
potential edge points corresponding to places in an
image where rapid changes in brightness occur. An
edge detector is a high-pass filter which outputs an
edge image with mainly high frequency information
using differential operators. Differential operators
measure the rate of change in a function, in this case,
the image brightness function. A large change in
image brightness over a short spatial distance indi-
cates the presence of an edge. Convolution masks
have been used with considerable success for the task
of differential operation and is one of the classical
image processing algorithms commonly used for
identifying image features. A convolution mask is a
discrete approximation to a two-dimensional convo-
lution integral. The convolution operation replaces a
pixel’s value with the sum of that pixels value and its
neighbors, each weighted by a factor. The weighting
factors are called the convolution kernel. To convolve
an image area, a sliding kernel matrix operates over
each row of pixels in the matrix image. At each point,
the kernel values multiply the image values under it,
sum the result, and replace the pixel at the center of
the kernel with the value. In this research, we employ
the Sobel kernel, which looks for edges in both
horizontal and vertical directions, and then combine
this information into a single metric. The masks are
shown as follows:
�1 �2 �1
0 0 0
1 2 1
266664
377775
Row Mask
�1 0 1
�2 0 2
�1 0 1
266664
377775
Column Mask
In the system, the Sobel filter followed by a
threshold operation is applied to generate the edge
images. Sobel edge detector yields the magnitude and
direction of edges by applying the row and column
masks. On the other hand, as opposed to the edge
detection using convolution kernel, we apply the
regional approaches which attempt to segment an
image into regions according to regional image data
similarity (or dissimilarity). For example, images con-
taining light objects on a dark background or dark
objects on a light background can be segmented by
means of a simple threshold operation. The following
relationship exists between the input image f(m, n)
and the output image g(m, n):
gðm; nÞ ¼I1 0Vf ðm; nÞ < S
I2 SVf ðm; nÞVfmax
8<: ð3Þ
where I1 and I2 are two arbitrary values with I1 p I2(usually I1 = 0, I2 = 1 are selected), and S is the
intensity threshold to be used. By choosing a suitable
S, pixels with value I1 in the output image represent
the objects and those with the value I2 represent the
background. Moreover, we also employ a region
growing technique, called pixel aggregation, in iden-
tifying the possible regions in the gray-scale images.
Pixel aggregation starts with a set of seed points.
Regions are grown by appending those similar neigh-
boring pixels to the seeds. Seeds are selected from the
gray level histogram with the most predominant
values. Although typical region analysis must be
carried out with a set of descriptors based on intensity
and spatial properties, such as moments, only inten-
sity-based regions are used in our simulations to
compare with the existing templates in order to devise
the region-based similarity measures. In sum, Table 2
details the possible attributes extracted in the forma-
tion of our heterogeneous image signatures. It is not
difficult to visualize the dimension of our image
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396 385
signature vectors can be very huge, and it is out of the
capacities of most current clustering algorithms. In the
next section, we shall discuss how the vectors can be
classified using a neural network model.
Fig. 7. Self-organization map in which there is one and only one
layer of neurons and all inputs are connected to all nodes in the map.
5. Classification using self-organization map
(SOM)
It has been postulated that human brain uses spatial
mapping to model complex data structure internally.
Much of the cerebral cortex is arranged as a two-
dimensional plane of interconnected neurons but it is
able to deal with concepts in much higher dimensions.
The characterization of topological feature-preserving
maps has received special attention in the literature
[14,15,24]. In particular, Amari [1] studied a contin-
uous-time dynamical version of this map extensively
to investigate the topological relation between the
self-organizing map and the input space governed
by the density p(x), the resolution and stability of
the map and the convergence speed.
The self-organization map, developed by Kohonen
[13], is an unsupervised two-layer neural network
used for classification and dimensional reduction. It
is a biologically motivated method for constructing a
structured representation of data from an often high-
dimensional input space. The goal of the SOM is to
create a mapping between the input stimulus space
and the output space. It is well-known for its unsu-
pervised learning, organizing and visualizing informa-
tion. An advantage of SOM over other clustering
algorithms is its ability to visualize high-dimensional
data using a two-dimensional grid while preserving
similarity between data points as much as possible.
SOM performs data compression, from multidimen-
sional data to a much lower-dimensional space, on the
vectors to be stored in the network using a technique
known as vector quantization. Vector quantization is
the most important technique in competitive learning.
The main idea is to categorize or distribute a given set
of input vectors into classes, and then represent any
vector just by the class into which it falls. The algo-
rithm employs a set of neurons which are arranged in
a network of a certain dimensionality. The imple-
mentations of Kohonen’s algorithm are predomi-
nantly in a two-dimensional plane which is shown
in Fig. 7.
The network shown is a one-layer, two-dimen-
sional Kohonen network. The most obvious point to
note is that the neurons are not arranged in layers as in
a backpropagation network, but rather on a flat grid.
All inputs connect to every node in the network.
Feedback is restricted to lateral interconnections to
immediate neighbouring nodes. The learning algo-
rithm organizes the nodes in the grid into local
neighbourhoods that act as feature classifiers on the
input data. The topographic map is autonomously
organized by a cyclic process of comparing input
patterns to vectors stored at each node. No training
response is specified for any training input.
In our experiment, each image is characterized by
the 20 image pads, having the attributes as shown in
Table 2. Each of them forms a signature vector and
stores in the heterogeneous image signature database.
In order to classify the images according to their
categories, the vectors are then feed forward to a
self-organization map (SOM) as shown in Fig. 7.
The details of the training algorithm are shown in
Table 3.
SOM has two properties found useful in our
classification objects. First, it quantises the space like
any other vector quantisation methods, and constitutes
an image classification. The topology preserving
property of SOM, coupled with our extracted image
attributes, constitutes a powerful and human stimu-
lated image classifier. Second, our attribute-rich and
huge dimension image signature vector can be visual-
Table 3
Learning algorithm of the self-organization neural network
(1) Initialize network
Define wij(t) to be the weight from input i to node j at time t.
Initialize weights from the n inputs to the nodes, as shown in
Fig. 7 with small random values. Set the initial radius of the
neighborhood around node j, Nj(0), to be large.
(2) Present input
Present input x0(t), xi(t), . . ., xn� 1(t), where xi(t) is the input to
node i at time t.
(3) Calculate distances
Compute the distance dj between the input and each output node
j, given by
d2j ¼Xn�1
i¼0
ðxiðtÞ � wijðtÞÞ2
(4) Select minimum distance
Designate the output node with minimum dj to be j*
(5) Update weights
Update weights for node j* and its neighbours, defined by the
neighbourhood function, Nj*(t). New weights are wij(t + 1) =
wij(t) + g(t)(xi(t)�wij(t)) for all j in Nj*(t) and 0V iV n� 1.
The term g(t) is a gain (0 < g(t) < 1) that decreases in time, thus,
slowing the weight adaptation. The neighbourhood Nj*(t)
decreases in size as time goes on, thus, localizing the area of
weight change.
(6) Repeat by going to step (2) for a number of iterations.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396386
ized and clustered through the map while the other
clustering techniques may suffer from the high dimen-
sionality generated from the images. As can be seen in
the following section, its performance is much better
than other classification technique.
Fig. 8. State diagram of the image classifier.
6. Experiments and evaluation
To study the performance of the above-described
image classifier, we have implemented a system
having a hundred target images. Each target image
downloaded from the WWW is 64� 64 in size with
maximum gray level of 256 and all are in GIF format.
In our experiment, one target image is selected from
each domain as an image template. While the target
images and templates are stored in Image Graphical
Database, all the image signature vectors generated
will be accumulated in the Image Signature Vector
Repository. Fig. 8 shows the UML state diagram of
the image classifier.
As shown in Fig. 8, the size 64� 64 target image is
first partitioned into four A1 image pads which are
then further subdivided into sixteen A2 image pads.
All image primitives of each image pad are extracted
through different methods in various activities. The
activity Calculate Color is to identify the average
HSV, activity Calculate DCT is to compute the DCT
coefficients while activity Calculate IS Measure is to
compare the similarity measures between the target
image and the templates. The activity Image Convo-
lution is responsible to process both vertical and
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396 387
horizontal convolution of each target image before
determining the similarity. Without any bias between
different image signatures, normalization which is
done in such a way that all values of the image
primitives lie between [0, 1] is performed before the
training in SOM. The normalized image vectors are
accumulated in our Image Signature Vector Reposi-
tory as shown in Fig. 8. Before generating the
topological classification in the SOM, other training
parameters, such as the number of epochs and the
dimension of grid layer in the SOM, have to be
defined. These parameters are adjusted in order to
locate the optimal result. A CSV format file is then
created as an output which is then subject to further
comparison and evaluation.
Different users may perceive image patterns on the
same objects differently and, consequently, may cat-
egorize the same objects into different groups. In our
system, in addition to implementing a more generic
displacement insensitive measure in image compar-
ison with a wide variety of low level image features,
Fig. 9. User interface of the uns
we allow users to incorporate their own preferences in
each image attributes so that they can tailor-make their
own profiles in image perception. We provide an
interactive user interface to capture the user prefer-
ences on different image features. Fig. 9 shows the
user interface of the system. Basically, the interface is
divided into three major components: Image browser,
File list panel and Control panel. Users can view all
images through the browser. At the same time, it
allows users to add and remove images in the file list
panel. The amendment will be reflected instantly in
the browser. More importantly, through the control
panel, users can adjust different weights for different
image primitives in the combo boxes as shown in Fig.
9. This is one step forward towards the stimulation of
the performance of image classification in a human-
like manner and understanding how users achieve
similarity within that domain and then building a
system that duplicates human performance. These
high-level user preferences with a set of image tem-
plates encapsulated with a wide range of low level
upervised image classifier.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396388
image features are more usefully viewed as comple-
mentary in our domain of nontextual Web document
classification.
In our experiment, hundred images from seven
categories are retrieved through the Web. The catego-
ries include fruit, scenery, mobile phone, Clinton,
flower, airplane and cartoon as shown in Appendix
A. Images P01, P23, P35, P56, P61, P79 and P93 are
selected as the image templates in our experiment. In
the training of SOM, the weight factors for color and
image similarity are set to be three and one, respec-
tively, while the other is set to zero in the level of low
resolution. However, the weight factor for color is set
to three and the rest is set to one in high-resolution
level as in 42 A2 image pads. Without any bias, the
contributions from color and noncolor attributes are
equal. The SOM with 1600 neurons organized on a
Fig. 10. Image clusters formed after the
two-dimension lattice with 40� 40 grid is used for the
training. The number of neurons used here is empiri-
cally determined by considering the number of similar
clusters in the training sets. After the adaptation phase
in the training of the SOM in 5750 epochs, the
connection weights are fixed. Each 636-dimension
image signature vector is presented to the network
and the winning unit is labeled with the code of the
corresponding image. After the presentation of all the
100 image signature vectors with 2000 subimages, a
SOM is obtained. The map reflects a certain order
which is a representative of the similarities and the
differences between the images. As shown in Fig. 10,
the images are clustered in a way that emphasizes the
grouping between images. The SOM captures the
family relationships among the images which are
likely to be grouped together under the same category.
training in SOM in 40� 40 grid.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396 389
In the image map, it is not difficult to find out that
the images with light background, such as the cate-
gories of mobile phone and cartoon are grouped
together while the opposite regions correspond to
images with dark backgrounds. Due to the different
background in the category fruit, they are further split
into two parts located at the upper left-hand and lower
right-hand side in the map. The subdivision is mainly
due to different texture of the background in these
images. The complexity of scenery-categorized image
signatures makes them be positioned at different
regions in the map. Generally speaking, the images
with brighter background are located at the upper
portion of the map with further subdivision based on
their categories. Fig. 11 shows another map generated
in which color attribute HSV of the 42 A2 image pads
Fig. 11. Clusters formed in SOM while the
is reduced. In the training, weight factor of the HSV is
set to be one with the others unchanged. The vectors
become stable after 6500 epochs in a grid of 40� 40.
Similarly, images with bright background P33–P48
and P85–P95 are clustered together at the lower right
side of the map. However, the map cannot distinguish
the category fruit, scenery and flower well as in Fig.
10. Color seems to be most dominant attribute in
clustering WWW images which are iconic-based and
have a relatively low resolution.
The human perception of images is subjective and,
in fact, different persons may have different percep-
tion criteria. While it is recognized that images are
best described through color, lines and shapes of
objects in the scene, one of the main objectives in
this paper is to argue that integration of these different
weights of color attribute are reduced.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396390
image features via a self-organization map enhances
the human perception in image classification. In order
to justify the performance of our system in utilizing a
wide range of image features, we repeat the experi-
ment for every single image features. Since line seg-
ment, region fragment and color are the fundamental
aspects of human perception, we compare the SOMs
generated from all these three image features individ-
ually. Fig. 12 shows the SOM generated from a single
input image attribute of line segment. The images
being considered in the same categories under human
perception are grouped under the same clusters. This
classification is done manually in our laboratory. The
100 images are groups into 43 image clusters, each of
them having one to seven images. It is worth to
mention that more than half of the clusters have only
one image. While it is possible in SOM that same set
Fig. 12. Clusters generated in SOM with a sole
of image vectors might come up with slightly different
clusters, our experiment indicates that the classifica-
tion using a single image feature from line segments is
unsatisfactory as compared with the result we ob-
tained in Fig. 10.
Figs. 13 and 14 illustrate another experiments
which measure the contribution from region frag-
ments and color attributes respectively. There are 35
and 28 image clusters in Figs. 13 and 14, respec-
tively. While the results look more acceptable than
the result we obtained in Fig. 12, the percentage of
single-image clusters remains high. Images under the
categories mobile phone, fruit and flower are scat-
tered around in the map, even through they all have a
clear background and foreground colors as shown in
Fig. 13. As compared with the SOM generated solely
from line segments and region fragments, the SOM
input image attribute from line segment.
Fig. 13. Clusters generated in SOM with a sole input image attribute from region fragments.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396 391
from color attribute provides a more reasonable re-
sult, with at least the entire group of images under the
category fruit located at the second quadrant of the
map as shown in Fig. 14. Similarly, the images under
the category airplane can also be found situated at the
fourth quadrant of the map. However, while color is
believed to be the most dominant image attribute than
all others, our approach using more than one image
features is a superior approach in image classifica-
tion. Fig. 10 shows that images with similar back-
ground, in terms of color, region fragments and line
segments, are grouped together. The clusters gener-
ated are more concise with fewer singletons as
compared with Fig. 14. This comparison shows our
content-sensitive, attribute-rich image classifier pro-
vides an efficient classification, particularly in Web-
based applications. In addition to classifying visually
similar images, a potential advantage of this approach
is that it provides an efficient indexing mechanism to
narrow down the search space, through the weights
trained in the SOM.
On the other hand, in order to evaluate the per-
formance of the classification ability of the self-
organization map, we compare our result obtained
from the SOM with a classical Hierarchical Agglom-
erative Clustering (HAC) algorithm. HAC is an iter-
ative procedure in which clusters are merged into one
bigger cluster according to a distance function. Two
clusters with least discrepancy are first identified and
merged together.
In our application, we start with some trivial
clusters in which each cluster represents a single
image. Two closest clusters according to the distance
function are first been identified. They are merged
Fig. 14. Clusters generated in SOM with a sole input image attribute from color.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396392
into a single cluster. The process repeats until the
desired number of clusters is reached. In fact, there
are many different measures for combining the
clusters. The one that we employed in the experi-
ment is group-average clustering in which the com-
bined dissimilarity of two clusters is the average of
all dissimilarities between members of each cluster
[22,28]. This function measures the cluster dissim-
ilarity using a weighted squared distance between the
two cluster centers li and lj
2ninj
ni þ njNli � ljN
2 ð4Þ
where ni and nj represent the cardinalities of clusters
i and j. Fig. 15 shows dendrogram produced by the
HAC algorithm using the between-groups linkage
cluster method. In the HAC dendrogram, the images
in categories mobile and cartoon are grouped
together at the upper branch. The main reason
behind is the bright background in these images. It
is also worth to note that images of Clinton P49,
P50, P51, P52 and P56 are clustered together. How-
ever, images under the category scenery are scattered
into different branches of the tree. While images P17
to P20 having similar image signatures are clustered
together, there is a lot of misclassification using the
HAC algorithm. Same phenomenon occurs in the
images under flower category. The situation becomes
even worse if the same distance measure with other
cluster method is used in the HAC as shown in Fig.
16.
When the two results obtained in SOM and HAC
are compared, it can be observed that the image
Fig. 15. Dendrogram using between-groups linkage cluster method.
Fig. 16. Dendrogram using within-groups linkage cluster method.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396 393
under categories mobile phone and cartoon cannot be
distinguished sharply in both algorithms. While the
images in fruit are under examined, we can find out
that both algorithms separate the images into two
major subgroups by differentiating their image back-
ground. Interestingly, they all show the same cluster-
ing of Clinton images. However, the classification
result in SOM seems to be much better than the one
in HAC. While the HAC is a linear method in which
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396394
relative distances between clusters reflects some dis-
similarity among images, the Kohonen algorithm
realizes a projection on a surface spanned by the
network topology. This is completely defined by the
relation of order between the neurons: a simple one-
dimensional relation or a bidimensional one. The
method of self-organizing maps is an original dimen-
sion reduction method. It has no real statistical analog
and, thus, can be very useful for specific applications
where linear methods fail. From a computational
point of view, HAC is a global operation while the
Kohonen algorithm is sequential process which does
not require any global information. Global order
between the data will reappear in the map after a small
number of iterations if there is any local modification
occur.
7. Conclusion
With the proliferation of multimedia computing
into the ubiquitous computers, image data will cer-
tainly be another important medium in future comput-
ing. Though there has been considerable research in
the area, the investigations have been carried out in
restricted domains and addressed only certain aspect
of the content-based information retrieval. In this
research, we have presented an approach to classify
images based on a self-organization map. The agent
integrates different image primitives into the classi-
fication procedure without demanding any particular
primitive to be dominant. We have evaluated the
approach using hundred images. The system imple-
mented and the results generated provide discrimina-
tion clusters. Evaluation shows that similar images
will fall onto the same region, in such a way that it is
possible to retrieve images under family relationships.
This approach performs particularly well in the clas-
sification for nontextual WWW documents, which
inherently have high-dimensional input spaces.
Acknowledgements
The work described in this paper was fully
supported by a grant from the Research Grants
Council of the Hong Kong Special Administrative
Region, China (Project No: CUHK4171/01E) and a
direct research grant of the Chinese University of
Hong Kong.
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396 395
Appendix A
S.W.K. Chan, M.W.C. Chong / Decision Support Systems 37 (2004) 377–396396
References
[1] S.-I. Amari, Field theory of self-organizing neural nets, IEEE
Transactions on Systems, Man and Cybernetics SMC-13
(1983) 741–748.
[2] S.W.K. Chan, Using heterogeneous linguistic knowledge in
local coherence identification for information retrieval, Jour-
nal of Information Science 26 (5) (2000) 313–328.
[3] S.W.K. Chan, K.S. Leung, W.S.F. Wong, An expert system for
the detection of cervical cancer cells using knowledge-based
image analyzer, Artificial Intelligence in Medicine 8 (1996)
67–90 (Elsevier Science).
[4] S.W.K. Chan, K.S. Leung, W.S.F. Wong, Object-oriented
knowledge based system for image diagnosis, Applied Artifi-
cial Intelligence 10 (1996) 407–438 (Taylor & Francis).
[5] S.W.K. Chan, J.L.L. Yeung, M.W.C. Chong, An image clas-
sification agent using hierarchical overlapped self-organiza-
tion map, Proceedings of the Symposium on Computational
Intelligence, ICSC Congress on Intelligent Systems and Ap-
plications (ISA’2000), University of Wollongong, Australia,
2000.
[6] E.J. Charles, A. Finkelstein, D.H. Salesin, Fast multi-resolu-
tion image querying, Proceedings of SIGGRAPH 95 (1995)
277–286.
[7] K.J. Cios, I. Shin, Image recognition neural network: IRNN,
Neurocomputing 7 (1995) 159–185.
[8] J.M. Corridoni, A. Del Bimbo, P. Pala, Image retrieval by
color semantics, Multimedia Systems 7 (1999) 175–183.
[9] G. Cortelazzo, G.A. Mian, G. Vezzi, P. Zamperoni, Trademark
shapes description by string matching techniques, Pattern Rec-
ognition 27 (8) (1994) 1005–1018.
[10] M. Das, R. Manmatha, E.M. Riseman, Indexing flower patent
images using domain knowledge, IEEE Intelligent Systems 14
(5) (1999) 24–33.
[11] M. Flickner, H. Sawhney, W. Niblack, J. Ashley, Query by
image and video content: the QBIC system, IEEE Computer
28 (9) (1995) 23–33.
[12] T. Honkela, S. Kaski, T. Kohonen, K. Lagus, Self-organizing
maps of very large document collections: justification for the
WEBSOM method, Proceedings of the 21st Annual Confer-
ence of the Gesellschaft fur Klassifikation e.V, (Classification,
Data Analysis, and Data Highways), Springer-Verlag, Berlin,
Germany, 1998, pp. 245–252.
[13] T. Kohonen, The self-organization map, Proceedings of the
IEEE 78 (1990) 1464–1480.
[14] T. Kohonen, Physiological interpretation of the self-organizing
map algorithm, Neural Networks 6 (7) (1993) 895–905.
[15] Z.-P. Lo, Y. Yu, B. Bavarian, Analysis of the convergence
properties of topology preserving neural networks, IEEE
Transactions on Neural Networks 4 (2) (1993) 207–220.
[16] W.-Y. Ma, B.S. Manjunath, A texture thesaurus for browsing
large aerial photographs, Journal of the American Society for
Information Science 49 (7) (1998) 633–648.
[17] W. Niblack, R. Barber, W. Equitz, M. Flickner, E. Glasman,
D. Petkovic, P. Yanker, C. Faloutsos, G. Taubin, QBIC pro-
ject: querying images by content using color, texture and
shape, Proceedings of SPIE 1908 (1993) 173–187.
[18] B.C. Ooi, K.L. Tan, T.S. Chua, W. Hsu, Fast image retrieval
using color-spatial information, The VLDB Journal 7 (2)
(1998) 115–128.
[19] A. Ortega, F. Carignano, S. Ayer, M. Vetterli, Soft caching:
web cache management techniques for images, IEEE Signal
Processing Society 1997 Workshop on Multimedia Signal
Processing, IEEE, WY, USA, 1997, pp. 475–480.
[20] R.E. Orwig, H. Chen, J.F. Nunamaker, A graphical, self-or-
ganizing approach to classifying electronic meeting output,
Journal of the American Society for Information Science 48
(2) (1997) 157–170.
[21] A. Pentland, R.W. Picard, S. Sclaroff, Photobook: tools for
content based manipulation of image databases, Proceedings
of the International Society for Optical Engineering 2185
(1994) 34–47.
[22] B.D. Ripley, Pattern Recognition and Neural Networks, Cam-
bridge Univ. Press, Cambridge, UK, 1996.
[23] J.R. Smith, S.-F. Chang, VisualSEEK: a fully automated con-
tent-based image query system, Proceedings of the ACMMul-
timedia (1996) 87–98.
[24] V. Tolat, An analysis of Kohonen’s self-organizing maps using
a system of energy functions, Biological Cybernetics 64
(1990) 155–164.
[25] V. Vinod, H. Murase, Focused retrieval of color images, Pat-
tern Recognition 30 (10) (1997) 1787–1797.
[27] J.Z. Wang, J. Li, D. Chan, G. Wiederhold, Semantics-sensi-
tive retrieval for digital picture libraries, D-Lib Magazine 5
(1999) 11.
[26] J.Z. Wang, G. Wiederhold, O. Firschein, X.W. Sha, Content-
based image indexing and searching using Daubechies’ wave-
lets, International Journal of Digital Libraries 1 (4) (1998)
311–328.
[28] J.H. Ward, Hierarchical grouping to optimize an objective
function, Journal of the American Statistical Association 58
(1963) 236–244.
[29] D.L. Wilson, A.J. Baddeley, R.A. Owens, A new metric for
grey-scale image comparison, International Journal of Com-
puter Vision 24 (1) (1997) 1–29.
[30] B. Zhu, H. Chen, Validating a geographical image retrieval
system, Journal of the American Society for Information
Science 51 (7) (2000) 625–634.
Samuel W.K. Chan received his MSc degree in 1986 from the
University of Manchester, UK, his MPhil degree in 1991 from the
Chinese University of Hong Kong and his PhD degree in 1998 from
the University of New South Wales, Australia—all in Computer
Science. Before joining the Chinese University of Hong Kong, he
has been working in computational intelligence since 1989. His
current research interests are in applying machine learning techni-
ques in image and text, data mining, information retrieval and
multimedia information processing with emphasis on image and
text. He has published articles in IEEE Transactions on Neural
Networks, IEEE Transactions on Knowledge and Data Engineering,
IEEE Transactions on Systems, Man and Cybernetics, Journal of
Information Science, Machine Translation, International Journal of
Computer Processing of Oriental Languages, Artificial Intelligence
in Medicine, Applied Artificial Intelligence and others.