issn 2083-2567 - jaiscrjaiscr.eu/issuespdf/jaiscr_vol4_no2_2014.pdfscientific results and methods...

2014ISSN 2083-2567

Volume 4, Number 2

JOURNAL OF ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING RESEARCH (JAISCR) is a semi-annual periodical published by the University of Social Sciences in Lodz, Poland. PUBLISHING AND EDITORIAL OFFICE: University of Social Sciences (SAN) Information Technology Institute (ITI) Sienkiewicza 9 90-113 Lodz tel.: +48 42 6646654 fax.: +48 42 6366251 e-mail: [email protected] URL: http://jaiscr.eu Print: Mazowieckie Centrum Poligrafii, ul. Duża 1, 05-270 Marki, www.c-p.com.pl, [email protected] Copyright © 2012 Academy of Management (SWSPiZ), Lodz, Poland. All rights reserved. AIMS AND SCOPE:Journal of Artificial Intelligence and Soft Computing Research is a refereed international journal whose focus is on the latest scientific results and methods constituting soft computing. The areas of interest include, but are not limited to: Artificial Intelligence in Modelling and Simulation Artificial Intelligence in Scheduling and Optimization Bioinformatics Computer Vision Data Mining Distributed Intelligent Processing Evolutionary Design Expert Systems Fuzzy Computing with Words Fuzzy Control Fuzzy Logic Fuzzy Optimisation Hardware Implementations Intelligent Database Systems Knowledge Engineering Multi-agent Systems Natural Language Processing Neural Network Theory and Architectures Robotics and Related Fields Rough Sets Theory: Foundations and Applications Speech Understanding Supervised and Unsupervised Learning Theory of Evolutionary Algorithms Various Applications

JOURNAL OF ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING RESEARCH (JAISCR) is a semi-annual periodical published by the University of Social Sciences in Lodz, Poland. PUBLISHING AND EDITORIAL OFFICE: University of Social Sciences (SAN) Information Technology Institute (ITI) Sienkiewicza 9 90-113 Lodz tel.: +48 42 6646654 fax.: +48 42 6366251 e-mail: [email protected] URL: http://jaiscr.eu Print: Mazowieckie Centrum Poligrafii, ul. Duża 1, 05-270 Marki, www.c-p.com.pl, [email protected] Copyright © 2012 Academy of Management (SWSPiZ), Lodz, Poland. All rights reserved. AIMS AND SCOPE:Journal of Artificial Intelligence and Soft Computing Research is a refereed international journal whose focus is on the latest scientific results and methods constituting soft computing. The areas of interest include, but are not limited to: Artificial Intelligence in Modelling and Simulation Artificial Intelligence in Scheduling and Optimization Bioinformatics Computer Vision Data Mining Distributed Intelligent Processing Evolutionary Design Expert Systems Fuzzy Computing with Words Fuzzy Control Fuzzy Logic Fuzzy Optimisation Hardware Implementations Intelligent Database Systems Knowledge Engineering Multi-agent Systems Natural Language Processing Neural Network Theory and Architectures Robotics and Related Fields Rough Sets Theory: Foundations and Applications Speech Understanding Supervised and Unsupervised Learning Theory of Evolutionary Algorithms Various Applications

Copyright © 2014 University of Social Science (SAN), Lodz, Poland. All rights reserved.

Shi Cheng, Yuhui Shi, Quande Qin, Qingyu Zhang and Ruibin BaiPoPULAtioN DiveRSitY MAiNteNANCeiN BRAiN StoRM oPtiMiZAtioN ALgoRithM ……………………………………………83

Po-Ming Lee and tzu-Chien hsiaoAPPLYiNg LCS to AffeCtive iMAge CLASSifiCAtioN iN SPAtiAL-fReQUeNCY DoMAiN …………………………..…………99

felix Jimenez, Masayoshi Kanoh, tomohiro Yoshikawa, takeshi furuhashi and tsuyoshi NakamuraeffeCt of RoBot UtteRANCeS USiNgoNoMAtoPoeiA oN CoLLABoRAtive LeARNiNg ………………………..……………125

Xiaoguang Wang, Xuan Liu, Nathalie Japkowicz and Stan MatwinAUtoMAteD APPRoACh to CLASSifiCAtioN of MiNe-LiKe oBJeCtS USiNg MULtiPLe-ASPeCt SoNAR iMAgeS………………………………………133

Tomasz Bruździński, Adam Krzyżak, Thomas Fevens and Łukasz JeleńWeB–BASeD fRAMeWoRK foR BReASt CANCeR CLASSifiCAtioN ...……...…………..………………………………149

Contents

JAISCR, 2014, Vol. 4, No. 2, pp. 83

POPULATION DIVERSITY MAINTENANCE IN BRAINSTORM OPTIMIZATION ALGORITHM

Shi Cheng1, Yuhui Shi2, Quande Qin3, Qingyu Zhang3 and Ruibin Bai41International Doctoral Innovation Centre, The University of Nottingham Ningbo, China

Division of Computer Science, University of Nottingham Ningbo, China,Ningbo, 315100, Zhejiang, China

2Department of Electrical & Electronic Engineering, Xi’an Jiaotong-Liverpool University,Suzhou, 215123, Jiangsu, China

3Department of Management Science, Shenzhen University, Shenzhen, China,Shenzhen, 518060, Guangdong, China

4Division of Computer Science, University of Nottingham Ningbo, China,Ningbo, 315100, Zhejiang, China

Abstract

The convergence and divergence are two common phenomena in swarm intelligence. Toobtain good search results, the algorithm should have a balance on convergence and diver-gence. The premature convergence happens partially due to the solutions getting clusteredtogether, and not diverging again. The brain storm optimization (BSO), which is a youngand promising algorithm in swarm intelligence, is based on the collective behavior ofhuman being, that is, the brainstorming process. The convergence strategy is utilized inBSO algorithm to exploit search areas may contain good solutions. The new solutionsare generated by divergence strategy to explore new search areas. Premature convergencealso happens in the BSO algorithm. The solutions get clustered after a few iterations,which indicate that the population diversity decreases quickly during the search. A defi-nition of population diversity in BSO algorithm is introduced in this paper to measure thechange of solutions’ distribution. The algorithm’s exploration and exploitation ability canbe measured based on the change of population diversity. Different kinds of partial re-initialization strategies are utilized to improve the population diversity in BSO algorithm.The experimental results show that the performance of the BSO is improved by part ofsolutions re-initialization strategies.

1 Introduction

Optimization, in general, is concerned withfinding the “best available” solution(s) for a givenproblem. Optimization problems can be simplydivided into unimodal problems and multimodalproblems. As indicated by the name, a unimodalproblem has only one optimum solution; on thecontrary, a multimodal problem has several or nu-merous optimum solutions, of which many are lo-cal optimal solutions. Galois theory has proved that

there is no quintic formula, i.e., the fifth and higherdegree equations are not generally solvable by rad-icals. The iterative method is a powerful tool tosolve the fifth and higher degree equations or otherdifficult functions. Based on the simple rules of it-eration, the solution(s) could be improved iterationby iteration, and finally reached to a “good enough”solution. Evolutionary optimization algorithms, orsimply the evolutionary algorithms (EAs), are akind of population-based iterative methods to solvedifficult optimization problems. The weakness of

– 9710.1515/jaiscr-2015-0001

84 Cheng S., Shi Y., Qin Q., Zhang Q. and Ba R.

EAs is generally difficult to find the global optimumsolutions for multimodal problems due to the possi-ble occurrence of the premature convergence [1–3].This balance could be controlled by setting an algo-rithm’s parameters [4].

An optimization problem in R n, or simply anoptimization problem, is a mapping f : R n → R k,where R n is termed as decision space [5] (or param-eter space [6], problem space), and R k is termed asobjective space [7]. Optimization problems can bedivided into two categories according to the valueof k. When k = 1, this kind of problem is calledSingle Objective Problems(SOPs), and when k > 1,this is called Multi-Objective Problems (or ManyObjective Optimization, MOO) [8, 9].

The evaluation function in optimization, f (x),maps decision variables to objective vectors. Eachsolution in decision space is associated with a fit-ness value in objective space. This situation is rep-resented in Fig. 1 for the case n = 3, and k = 2.

Evolutionary computation algorithm is inspiredfrom the natural selection process of the physicalworld, and the swarm intelligence mimics the be-haviors of a population of animals/humans in thereal world. Both evolutionary computation algo-rithms and swarm intelligence algorithms can beseen as decentralized systems, and a populationof interacting individuals searches in the solutionspace to optimize a function or goal based on col-lective adaptation [10].

Swarm intelligence is based on a population ofindividuals [11]. In swarm intelligence, an algo-rithm maintains and successively improves a col-lection of potential solutions until some stoppingcondition is met. The solutions are initialized ran-domly in the search space. The search informationis propagated through the interaction among solu-tions. Based on the solutions convergence and di-vergence, solutions are guided toward the better andbetter areas.

In swarm intelligence algorithms, there are sev-eral solutions which exist at the same time. The pre-mature convergence may happen due to the solutiongetting clustered together too fast. The populationdiversity is a measure of exploration and exploita-tion. Based on the population diversity changingmeasurement, the state of exploration and exploita-tion can be obtained. The population diversity def-

inition is the first step to give an accurate obser-vation of the search state. Many studies of popu-lation diversity in evolutionary computation algo-rithms and swarm intelligence have been proposedin [2, 12–18].

Brain storm optimization (BSO) algorithm is ayoung and promising swarm intelligence algorithm,which mimics the brainstorming process in which agroup of people solves a problem together [19, 20].In a brain storm optimization algorithm, the solu-tions are divided into several clusters. The solutionsbeing divided into several clusters can be seen as thepopulation diverging into separate species, whichare similar to the speciation in the natural selection.The new solutions are generated based on individ-ual(s) in one or two clusters.

BSO algorithm has been utilized to differentkinds of problems, such as multimodal optimiza-tion [21], multi-objective optimization [22,23]. Theparameters in BSO are investigated in [24], the so-lution clustering is analyzed in [25], the populationdiversity management is studied in [26]. Many vari-ants of BSO algorithms are proposed. In [27], to re-duce the algorithm computational burden, a simplegrouping method (SGM) in the grouping operator isintroduced to replace the clustering method.

Brain storm optimization algorithm has beenutilized into several kinds of real-world problems,such as economic dispatch considering wind power[28], closed-loop BSO algorithm on optimal satel-lite formation reconfiguration problem [29], preda-torCprey BSO algorithm for DC Brushless Motor[30], and quantum-behaved BSO algorithm on solv-ing Loney’s Solenoid problem [31].

In this paper, we give a population diversitydefinition of the brain storm optimization algo-rithm, and test several partial re-initializing solu-tions strategies to enhance the population diversityand to help solutions jump out of local optima. Theidea behind the re-initialization is to increase thepossibility for solutions “jumping out” of local op-tima, and to keep the ability for the algorithm to find“good enough” solution.

This paper is organized as follows. Section 2 re-views the basic brain storm optimization algorithm.Section 3 gives the definition of population diver-sity and the diversity maintaining strategies of BSOalgorithm. Experiments on unimodal and multi-

85Cheng S., Shi Y., Qin Q., Zhang Q. and Ba R.

EAs is generally difficult to find the global optimumsolutions for multimodal problems due to the possi-ble occurrence of the premature convergence [1–3].This balance could be controlled by setting an algo-rithm’s parameters [4].

An optimization problem in R n, or simply anoptimization problem, is a mapping f : R n → R k,where R n is termed as decision space [5] (or param-eter space [6], problem space), and R k is termed asobjective space [7]. Optimization problems can bedivided into two categories according to the valueof k. When k = 1, this kind of problem is calledSingle Objective Problems(SOPs), and when k > 1,this is called Multi-Objective Problems (or ManyObjective Optimization, MOO) [8, 9].

The evaluation function in optimization, f (x),maps decision variables to objective vectors. Eachsolution in decision space is associated with a fit-ness value in objective space. This situation is rep-resented in Fig. 1 for the case n = 3, and k = 2.

Evolutionary computation algorithm is inspiredfrom the natural selection process of the physicalworld, and the swarm intelligence mimics the be-haviors of a population of animals/humans in thereal world. Both evolutionary computation algo-rithms and swarm intelligence algorithms can beseen as decentralized systems, and a populationof interacting individuals searches in the solutionspace to optimize a function or goal based on col-lective adaptation [10].

Swarm intelligence is based on a population ofindividuals [11]. In swarm intelligence, an algo-rithm maintains and successively improves a col-lection of potential solutions until some stoppingcondition is met. The solutions are initialized ran-domly in the search space. The search informationis propagated through the interaction among solu-tions. Based on the solutions convergence and di-vergence, solutions are guided toward the better andbetter areas.

In swarm intelligence algorithms, there are sev-eral solutions which exist at the same time. The pre-mature convergence may happen due to the solutiongetting clustered together too fast. The populationdiversity is a measure of exploration and exploita-tion. Based on the population diversity changingmeasurement, the state of exploration and exploita-tion can be obtained. The population diversity def-

inition is the first step to give an accurate obser-vation of the search state. Many studies of popu-lation diversity in evolutionary computation algo-rithms and swarm intelligence have been proposedin [2, 12–18].

Brain storm optimization (BSO) algorithm is ayoung and promising swarm intelligence algorithm,which mimics the brainstorming process in which agroup of people solves a problem together [19, 20].In a brain storm optimization algorithm, the solu-tions are divided into several clusters. The solutionsbeing divided into several clusters can be seen as thepopulation diverging into separate species, whichare similar to the speciation in the natural selection.The new solutions are generated based on individ-ual(s) in one or two clusters.

BSO algorithm has been utilized to differentkinds of problems, such as multimodal optimiza-tion [21], multi-objective optimization [22,23]. Theparameters in BSO are investigated in [24], the so-lution clustering is analyzed in [25], the populationdiversity management is studied in [26]. Many vari-ants of BSO algorithms are proposed. In [27], to re-duce the algorithm computational burden, a simplegrouping method (SGM) in the grouping operator isintroduced to replace the clustering method.

Brain storm optimization algorithm has beenutilized into several kinds of real-world problems,such as economic dispatch considering wind power[28], closed-loop BSO algorithm on optimal satel-lite formation reconfiguration problem [29], preda-torCprey BSO algorithm for DC Brushless Motor[30], and quantum-behaved BSO algorithm on solv-ing Loney’s Solenoid problem [31].

In this paper, we give a population diversitydefinition of the brain storm optimization algo-rithm, and test several partial re-initializing solu-tions strategies to enhance the population diversityand to help solutions jump out of local optima. Theidea behind the re-initialization is to increase thepossibility for solutions “jumping out” of local op-tima, and to keep the ability for the algorithm to find“good enough” solution.

This paper is organized as follows. Section 2 re-views the basic brain storm optimization algorithm.Section 3 gives the definition of population diver-sity and the diversity maintaining strategies of BSOalgorithm. Experiments on unimodal and multi-

POPULATION DIVERSITY MAINTENANCE IN . . .

x1

x2

x3

Ω = {x ∈ Rn}

Solution Space →

f2

f1

Λ = {y ∈ Rk}Objective Space

(a) (b)

Figure 1. The mapping from solution space to objective space.

modal benchmark functions are conducted in Sec-tion 4. The analysis and discussion of the perfor-mance of the BSO algorithm and the population di-versity maintaining are given in Section 5. Finally,Section 6 concludes with some remarks and futureresearch directions.

2 Brain Storm Optimization

The convergence and divergence are two com-mon phenomena in swarm intelligence. The conver-gence and divergence information also can be uti-lized on the search. The framework of divergenceand convergence is shown in Fig. 2. The conver-gence strategy is utilized to explore new possiblesearch region, while the divergence strategy is uti-lized to exploit existing regions may contains goodsolutions.

The brain storm optimization algorithm andfirework algorithm [32, 33] algorithm can be an-alyzed by the convergence and divergence frame-work. In BSO algorithm, the random initialized so-lutions are convergence to different areas. This isa convergence strategy, and the new solutions aregenerated to diverge the search space. The Fireworkalgorithm [32,33] also utilized convergence and di-vergence strategies in optimization. Mimicking theexplosion of fireworks, the solutions are generatedto diverge into large search space. The solutionswith good fitness values are selected, which indi-cated that the solutions are converged to small ar-eas. The convergence and divergence strategies areprocess iteration by iteration. Based on the itera-tions of convergence and divergence, the solutionscould be clustered to small regions finally.

The BSO algorithm, which is a young andpromising algorithm in swarm intelligence, is basedon the collective behavior of human being, that is,the brainstorming process [19, 20, 34]. The specia-tion is a process of natural selection, which meansthat the population diverging into separate species[35, 36]. The solutions in BSO are also diverginginto several clusters. The new solutions are gen-erated based on the mutation of one individual orinteractive of two individuals.

The original BSO algorithm is simple in con-cept and easy in implementation. The main proce-dure is given in Algorithm 1. There are three strate-gies in this algorithm: the solution clustering, newindividual generation, and selection [25].

In a brain storm optimization algorithm, the so-lutions are separated into several clusters. The bestsolutions of each cluster are kept to the next iter-ation. New individual can be generated based onone or two individuals in clusters. The exploita-tion ability is enhanced when the new individual isclose to the best solution so far. While the explo-ration ability is enhanced when the new individualis randomly generated, or generated by individualsin two clusters.

The brain storm optimization algorithm is akind of search space reduction algorithm [37]; allsolutions will get into several clusters eventually.These clusters indicate a problem’s local optima.The information of an area contains solutions withgood fitness values are propagated from one clus-ter to another [38]. This algorithm will explore indecision space at first, and the exploration and ex-ploitation will get into a state of equilibrium afteriterations.


Random Initialization

Divergence Strategy

Exploration of new possibilities

Convergence Strategy

Exploitation of old certainties

Initial Solutions

Chosen SolutionImproved Solution

Figure 2. The framework of divergence and convergence in swarm intelligence.

Algorithm 1: The procedure of the brain storm optimization algorithm

1 Initialization: Randomly generate n potential solutions (individuals), and evaluate the n individuals;2 while have not found “good enough” solution or not reached the pre-determined maximum number

of iterations do3 Clustering: Cluster n individuals into m clusters by a clustering algorithm;4 New individuals’ generation: randomly select one or two cluster(s) to generate new individual;5 Selection: The newly generated individual is compared with the existing individual with the

same individual index, the better one is kept and recorded as the new individual;6 Evaluate the n individuals;

The brain storm optimization algorithm alsocan be extended to solve multiobjective optimiza-tion problems [22, 34]. Unlike the traditional mul-tiobjective optimization methods, the brain stormoptimization algorithm utilized the objective spaceinformation directly. Clusters are generated in theobjective space; and for each objective, individu-als are clustered in each iteration. The individual,which perform better in most of objectives are keptto the next iteration, and other individuals are ran-domly selected to keep the diversity of solutions.

2.1 Solution Clustering

The aim of solution clustering is to converge thesolutions into small regions. Different clusteringalgorithms can be utilized in the brain storm opti-mization algorithm. The clustering strategy can bereplaced by other convergence method, such as sim-ple grouping method (SGM) [27]. In this paper, thebasic k-means clustering algorithm is utilized.

Clustering is the process of grouping similarobjects together. From the perspective of ma-chine learning, the clustering analysis is sometimes

termed as unsupervised learning. There are Npoints in the given input, D = {xi}N

i=1, the usefuland functional patterns can be obtained through thesimilarity calculation among points [39]. Every so-lution in the brain storm optimization algorithm isspread in the search space. The distribution of so-lutions can be utilized to reveal the landscapes of aproblem.

The procedure of solution clustering is given inAlgorithm 2. The clustering strategy divides indi-viduals into several clusters. This strategy could re-fine a search area. After many iterations, all solu-tions may be clustered into a small region. A proba-bility value pclustering is utilized to control the prob-ability of replacing a cluster center by a randomlygenerated solution. This could avoid the prematureconvergence, and help individuals “jump out” of thelocal optima.

2.2 New Individual Generation

The procedure of new individual generation isgiven in Algorithm 3. A new individual can be gen-

Cheng S., Shi Y., Qin Q., Zhang Q. and Ba R.


Divergence Strategy




Initial Solutions


















Divergence Strategy




Initial Solutions


















Divergence Strategy




Initial Solutions

















erated based on one or several individuals or clus-ters. In the original brain storm optimization al-gorithm, a probability value pgeneration is utilized todetermine a new individual being generated by oneor two “old” individuals. Generating an individualfrom one cluster could refine a search region, and itenhances the exploitation ability. On the contrast,an individual, which is generated from two or moreclusters, may be far from these clusters. The explo-ration ability is enhanced in this scenario.

The probability poneCluster and probabilityptwoCluster are utilized to determine the cluster cen-ter or random individual will be chosen in one clus-ter or two clusters generation case, respectively.In one cluster generation case, the new individualfrom center or random individual can control theexploitation region. While in several clusters gen-eration case, the random individuals could increasethe population diversity of swarm.

The new individuals are generated according tothe functions (1) and (2).

xinew = xi

old +ξ(t)× rand() (1)

ξ(t) = logsig(0.5×T − t

c)× rand() (2)

where xinew and xi

old are the ith dimension of xnewand xold; and the value xold is a copy of one indi-vidual or the combination of two individuals. Theparameter T is the maximum number of iterations,t is the current iteration number, c is a coefficient tochange logsig() function’s slope.

2.3 Selection

The selection strategy is utilized to keep goodsolutions in all individuals. A modified step sizeand individual generation was proposed in [40].The step size can be utilized to balance the conver-gence speed of the algorithm. The better solutionsare kept by the selection strategy, while clusteringstrategy and generation strategy add new solutionsinto the swarm to keep the diversity for the wholepopulation.

3 Population Diversity

The most important factor affecting an op-timization algorithm’s performance is its ability

of “exploration” and “exploitation.” Explorationmeans the ability of a search algorithm to exploredifferent areas of the search space in order to havehigh probability to find good promising solutions.Exploitation, on the other hand, means the abilityto concentrate the search around a promising regionin order to refine a candidate solution. A good op-timization algorithm should optimally balance thetwo conflicted objectives [38, 41].

In a brain storm optimization algorithm, the so-lutions are grouped into several clusters. The bestsolutions of each cluster are kept to the next itera-tion due to the selection operation. New individualcan be generated based on one or two individuals inclusters. The exploitation ability is enhanced whenthe new individual is close to the best solution sofar. While the exploration ability is enhanced whenthe new individual is randomly generated, or gener-ated by individuals in two clusters.

Population diversity is useful for measuring anddynamically adjusting an algorithm’s ability of ex-ploration or exploitation accordingly. In the brainstorm optimization algorithm, many solutions areexisted at the same time, and these solutions aregathered into several clusters. The solutions mayget together into a small region after iterations. Theclustering algorithm is difficult to cluster solutionsinto different group when every solution is within asmall region. The algorithm’s exploration ability isdecreased at this time.

It is important to find a metric to measure thepopulation diversity of solutions in the brain stormoptimization algorithm. From the measurement, wecan monitor the search of solutions.

3.1 Population Diversity Definition

Population diversity is a measurement of solu-tions’ distribution. In [20], proposed Dc, Dv, andDe to measure normalized distance for a cluster,inter-cluster diversity, and information entropy forthe population, respectively. Here, in this paper, wedefine the population diversity given below, whichis dimensional-wise and based on the L1 norm.



Divergence Strategy




Initial Solutions

















Algorithm 3: The new individual generation strategy

1 New individual generation: randomly select one or two cluster(s) to generate new individual;2 Randomly generate a value rgeneration in the range [0,1);3 if the value rgeneration is less than a probability pgeneration then4 Randomly select a cluster, and generate a random value roneCluster in the range [0,1);5 if the value roneCluster is smaller than a pre-determined probability poneCluster;6 then7 Select the cluster center and add random values to it to generate new individual;8 else9 Randomly select an individual from this cluster and add random value to the individual to

generate new individual;

10 else11 randomly select two clusters to generate new individual;12 Generate a random value rtowCluster in the range [0,1);13 if the value rtowCluster is less than a pre-determined probability ptwoCluster then14 the two cluster centers are combined and then added with random values to generate new

individual;15 else16 two individuals from each selected cluster are randomly selected to be combined and added

with random values to generate new individual;

17 The newly generated individual is compared with the existing individual with the same individualindex, the better one is kept and recorded as the new individual;

x j =1m

m

∑i=1

xi j

Div j =1m

m

∑i=1

|xi j − x j|

Div =n

∑j=1

w jDiv j

where x j represents the pivot of solutions in di-mension j, and Div j measures solution diversitybased on L1 norm for dimension j. Then we de-fine x = [x1, · · · , x j, · · · , xn], x represents the meanof current solutions on each dimension, and Div =[Div1, · · · ,Div j, · · · ,Divn], which measures solutiondiversity based on L1 norm for each dimension. Divmeasures the whole group’s population diversity.

Without loss of generality, every dimension isconsidered equally. Setting all weights w j =

1n ,

then the dimension-wise population diversity can berewritten as:

Div =n

∑j=1

1n

Div j =1n

n

∑j=1

Div j

3.2 Population Diversity Maintenance

Population diversity is a measurement of popu-lation state of exploration or exploitation. It illus-trates the distribution of solutions. The solutionsdiverging means that the search is in an explorationstate, on the contrary, solutions clustering tightlymeans that the search is in an exploitation state [42].

The solutions get clustered in search space, andit may not be easy to diverge. The population di-versity is decreased when all solutions are clusteredinto one small region. Many strategies are pro-posed to enhance the population diversity in evolu-tionary computation algorithms and swarm intelli-gence. These strategies include inserting randomlygenerated individuals, niching [43,44], solutions re-initialization [37, 42], or reconstructing the fitnessfunction with the consideration of the age of indi-viduals [45] or the entropy of the population [46].

In this paper, the solutions partial re-initialization is utilized to promote diversity of BSOalgorithm. In the brain storm optimization algo-rithm, the new individual is generated by addingone or two individual(s) with the noise based onequation (1). However, every solution will be very


Algorithm 3: The new individual generation strategy

1 New individual generation: randomly select one or two cluster(s) to generate new individual;2 Randomly generate a value rgeneration in the range [0,1);3 if the value rgeneration is less than a probability pgeneration then4 Randomly select a cluster, and generate a random value roneCluster in the range [0,1);5 if the value roneCluster is smaller than a pre-determined probability poneCluster;6 then7 Select the cluster center and add random values to it to generate new individual;8 else9 Randomly select an individual from this cluster and add random value to the individual to

generate new individual;

10 else11 randomly select two clusters to generate new individual;12 Generate a random value rtowCluster in the range [0,1);13 if the value rtowCluster is less than a pre-determined probability ptwoCluster then14 the two cluster centers are combined and then added with random values to generate new

individual;15 else16 two individuals from each selected cluster are randomly selected to be combined and added

with random values to generate new individual;

17 The newly generated individual is compared with the existing individual with the same individualindex, the better one is kept and recorded as the new individual;

x j =1m

m

∑i=1

xi j

Div j =1m

m

∑i=1

|xi j − x j|

Div =n

∑j=1

w jDiv j

where x j represents the pivot of solutions in di-mension j, and Div j measures solution diversitybased on L1 norm for dimension j. Then we de-fine x = [x1, · · · , x j, · · · , xn], x represents the meanof current solutions on each dimension, and Div =[Div1, · · · ,Div j, · · · ,Divn], which measures solutiondiversity based on L1 norm for each dimension. Divmeasures the whole group’s population diversity.

Without loss of generality, every dimension isconsidered equally. Setting all weights w j =

1n ,

then the dimension-wise population diversity can berewritten as:

Div =n

∑j=1

1n

Div j =1n

n

∑j=1

Div j

3.2 Population Diversity Maintenance

Population diversity is a measurement of popu-lation state of exploration or exploitation. It illus-trates the distribution of solutions. The solutionsdiverging means that the search is in an explorationstate, on the contrary, solutions clustering tightlymeans that the search is in an exploitation state [42].

The solutions get clustered in search space, andit may not be easy to diverge. The population di-versity is decreased when all solutions are clusteredinto one small region. Many strategies are pro-posed to enhance the population diversity in evolu-tionary computation algorithms and swarm intelli-gence. These strategies include inserting randomlygenerated individuals, niching [43,44], solutions re-initialization [37, 42], or reconstructing the fitnessfunction with the consideration of the age of indi-viduals [45] or the entropy of the population [46].

In this paper, the solutions partial re-initialization is utilized to promote diversity of BSOalgorithm. In the brain storm optimization algo-rithm, the new individual is generated by addingone or two individual(s) with the noise based onequation (1). However, every solution will be very


similar in each dimension when the solutions getclustered into a small region. The original BSOalgorithm may not be easy to escape from localoptima. The partial re-initialization in the wholesearch space could make many solutions divergeinto large search areas. The idea behind the re-initialization is to increase possibility for solutions“jumping out” of local optima, and to keep the abil-ity for algorithm to find “good enough” solutions.

Algorithm 4 gives the procedure of the BSO al-gorithm with re-initialization strategy. After sev-eral iterations, part of solutions are re-initialized inwhole search space, which increases the possibilityof solutions “jumping out” of local optima. Accord-ing to the number of re-initialized solutions, thisstrategy can be divided into following categories:

– The number of re-initialized solutions is de-creasing during the search process. Morethan half solutions are re-initialized at the be-ginning of search, and the number of re-initialized solutions is linearly decreased at eachre-initialization. This strategy is to focus on theexploration at first, and the exploitation at theend of the search.

– Part of solutions re-initialized after certain iter-ations. The number of re-initialized solutions isfixed during the search process. This approachcan obtain a great ability of exploration due tothe possibility that part of solutions, e.g., half ofsolutions, will have the chance to escape fromlocal optima.

– The number of re-initialized solutions is in-creasing during the search process. Lessthan half solutions are re-initialized at the be-ginning of search, and the number of re-initialized solutions is linearly increased at eachre-initialization. This strategy is to focus on theexploitation at first, and the exploration at theend of the search.

4 Experimental Study

Wolpert and Macerady have proved that un-der certain assumptions no algorithm is better thanother one on average for all problems [47]. The aimof the experiment is not to compare the ability orthe efficacy of the brain storm optimization algo-

rithm with other swarm intelligence algorithms, butthe population diversity property of the brain stormoptimization algorithm.

4.1 Benchmark Test Functions and Pa-rameter Setting

The experiments have been conducted to testthe proposed BSO algorithm on the benchmarkfunctions listed in Table 1. Considering the gen-erality, eleven standard benchmark functions wereselected, which include five unimodal functions andseven multimodal functions [48, 49]. All functionsare run 50 times to ensure a reasonable statisticalresult. There are 1500 iterations for 50 dimensionalproblems in every run. Randomly shifting of the lo-cation of optimum is utilized in each dimension foreach run.

In all experiments, the brain storm optimiza-tion has 200 individuals, and parameters are set asthe following, let pclustering = 0.2, pgeneration = 0.6,poneCluster = 0.4 and ptwoCluster = 0.5. The parameterk in k-means algorithm is 20. The coefficient c is setas 20.0. In the BSO with solution re-initialization,the solutions will be partially re-initialized aftereach 200 iterations. In the decreasing number ofsolution re-initialization case, there are 20 solu-tions are kept at the first time, the number of keptsolutions increase 20 at each re-initialization, and140 solutions are kept at the last time. In the in-creasing number of solution re-initialization case,there are 180 solutions are kept at the first time,the number of kept solutions increase 20 at each re-initialization, and 60 solutions are kept at the lasttime.

4.2 Experimental Results

Several measures of performance are utilized inthis paper. The first is the best fitness value attainedafter a fixed number of iterations. In our case, wereport the best result found after 1500 for 50 dimen-sional problems. The following measures are themedian, the worst and mean value of best fitnessvalues in each run. It is possible that an algorithmwill rapidly reach a relatively good result while be-coming trapped into a local optimum. These threevalues give a measure of algorithms’ reliability androbustness.


Algorithm 4: The procedure of the population diversity promoted BSO algorithm


of iterations do3 Clustering: Cluster n individuals into m clusters by a clustering algorithm;4 New individual generation: randomly select one or two cluster(s) to generate new individual;5 Selection: The newly generated individual is compared with the existing individual with the same

individual index, the better one is kept and recorded as the new individual;6 Re-initialization: partially re-initialize some solutions after certain iterations;7 Evaluate the n individuals;

Table 1. The benchmark functions used in experimental study, where n is the dimension of each problem,z = (x−o), x = [x1,x2, · · · ,xn], oi is an randomly generated number in problem’s search space S and it is

different in each dimension, global optimum x∗ = o, fmin is the minimum value of the function, andS ⊆ R n.

Function Test Function S fmin

Parabolic f0(x) =n∑

i=1z2

i +bias0 [−100,100]n -450.0

Schwefel’s P2.22 f1(x) =n∑

i=1|zi|+∏n

i=1 |zi|+bias1 [−10,10]n -330.0


i=1(

i∑

k=1zk)

2 +bias2 [−100,100]n 450.0

Step f3(x) =n∑

i=1(�zi +0.5�)2 +bias3 [−100,100]n 330.0

Quartic Noise f4(x) =n∑

i=1iz4

i + random[0,1)+bias4 [−1.28,1.28]n -450.0

Rosenbrock f5(x) =n−1∑

i=1[100(zi+1 − z2

i )2 +(zi −1)2]+bias5 [−10,10]n 180.0

Rastrigin f6(x) =n∑

i=1[z2

i −10cos(2πzi)+10]+bias6 [−5.12,5.12]n -330.0

Noncontinuous f7(x) =n∑

i=1[y2

i −10cos(2πyi)+10]+bias7[−5.12,5.12]n 450.0

Rastrigin yi =

{zi |zi|< 1

2round(2zi)

2 |zi| ≥ 12

Ackley f8(x) =−20exp(−0.2

√1n

n∑

i=1z2

i

)[−32,32]n 180.0

−exp(

1n

n∑

i=1cos(2πzi)

)+20+ e+bias8

Griewank f9(x) = 14000

n∑

i=1z2

i −n∏i=1

cos( zi√i)+1+bias9 [−600,600]n 120.0

f10(x) = πn{10sin2(πy1)+

n−1∑

i=1(yi −1)2

[−50,50]n 330.0Generalized ×[1+10sin2(πyi+1)]+(yn −1)2}Penalized +∑n

i=1 u(zi,10,100,4)+bias10yi = 1+ 1

4 (zi +1)

u(zi,a,k,m) =

k(zi −a)m zi > a,0 −a < zi < ak(−zi −a)m zi <−a










i=1z2

i +bias0 [−100,100]n -450.0


i=1|zi|+∏n

i=1 |zi|+bias1 [−10,10]n -330.0


i=1(

i∑

k=1zk)

2 +bias2 [−100,100]n 450.0

Step f3(x) =n∑

i=1(�zi +0.5�)2 +bias3 [−100,100]n 330.0


i=1iz4

i + random[0,1)+bias4 [−1.28,1.28]n -450.0


i=1[100(zi+1 − z2

i )2 +(zi −1)2]+bias5 [−10,10]n 180.0


i=1[z2

i −10cos(2πzi)+10]+bias6 [−5.12,5.12]n -330.0


i=1[y2

i −10cos(2πyi)+10]+bias7[−5.12,5.12]n 450.0

Rastrigin yi =

{zi |zi|< 1

2round(2zi)

2 |zi| ≥ 12


√1n

n∑

i=1z2

i

)[−32,32]n 180.0

−exp(

1n

n∑

i=1cos(2πzi)

)+20+ e+bias8


n∑

i=1z2

i −n∏i=1

cos( zi√i)+1+bias9 [−600,600]n 120.0

f10(x) = πn{10sin2(πy1)+

n−1∑

i=1(yi −1)2


i=1 u(zi,10,100,4)+bias10yi = 1+ 1

4 (zi +1)

u(zi,a,k,m) =











i=1z2

i +bias0 [−100,100]n -450.0


i=1|zi|+∏n

i=1 |zi|+bias1 [−10,10]n -330.0


i=1(

i∑

k=1zk)

2 +bias2 [−100,100]n 450.0

Step f3(x) =n∑

i=1(�zi +0.5�)2 +bias3 [−100,100]n 330.0


i=1iz4

i + random[0,1)+bias4 [−1.28,1.28]n -450.0


i=1[100(zi+1 − z2

i )2 +(zi −1)2]+bias5 [−10,10]n 180.0


i=1[z2

i −10cos(2πzi)+10]+bias6 [−5.12,5.12]n -330.0


i=1[y2

i −10cos(2πyi)+10]+bias7[−5.12,5.12]n 450.0

Rastrigin yi =

{zi |zi|< 1

2round(2zi)

2 |zi| ≥ 12


√1n

n∑

i=1z2

i

)[−32,32]n 180.0

−exp(

1n

n∑

i=1cos(2πzi)

)+20+ e+bias8


n∑

i=1z2

i −n∏i=1

cos( zi√i)+1+bias9 [−600,600]n 120.0

f10(x) = πn{10sin2(πy1)+

n−1∑

i=1(yi −1)2


i=1 u(zi,10,100,4)+bias10yi = 1+ 1

4 (zi +1)

u(zi,a,k,m) =



Table 2 gives results of the brain storm op-timization algorithm solving unimodal and mul-timodal problems. The population diversity en-hanced BSO performs better than the original BSOfor most problems, especially for the unimodalproblems.

For traditional algorithms, the multimodalproblems are difficult to solve than unimodal prob-lems due to that the multimodal problems havemany local optima. However, the brain storm op-timization algorithm may be more suitable for mul-timodal problems. The concept of brain storm op-timization algorithm is not to cluster all solutionsinto one small region, but many regions. From theresults, we can find that the original BSO algorithmperforms well on the multimodal functions, and thepopulation diversity enhanced BSO algorithm havemore improvement in solving unimodal functionsthan multimodal functions.

5 Analysis and Discussion

5.1 Population Diversity Monitor

The simulation results give the convergencecurves of benchmark functions. Fig. 3 displaysthe average performance of BSO algorithms solv-ing five unimodal functions. Fig. 4 displays the av-erage performance of BSO algorithms solving sixmultimodal functions. The brain storm optimiza-tion algorithm has a fast convergence at the begin-ning of search, which indicates that the good searchregions can be located after several solution clus-tering strategies. However, the ability of preventingpremature convergence, and “jumping out” of lo-cal optima should be improved. Keeping the globalsearch ability, and improving the local search abilityshould be investigated in the brain storm optimiza-tion algorithm.

5.2 Population Diversity Analysis

Fig. 5 and Fig. 6 display the population diver-sity changes during the search process. There aremany vibrations of population diversity change inthe original BSO solving unimodal functions. Thepopulation diversity changes smoothly in the orig-inal BSO solving multimodal functions. This maybe caused by the different properties of BSO solv-ing unimodal and multimodal functions.

The population diversity is enhanced throughthe re-initialization strategy. From Fig. 5 and Fig.6, we can see that the population diversity change isrelated to the number of re-initialized solutions. Ingeneral, the larger the number of re-initialized solu-tions is, the smaller the value of population diversityis.

In this experiments, we only tested the re-initialization strategy with fixed number of itera-tions, and the number of re-initialized solutions isfixed or linear changed. To reveal the relation be-tween the algorithm’s performance and the popula-tion diversity change, more investigation should betaken on the mechanism of BSO solving differenttypes of problems. The population diversity main-tained BSO has promoted the population diversityafter certain iterations. The value of population di-versity is kept at a large number during the search,this could help the solutions “jump out” a local op-tima.

6 Conclusion

The convergence and divergence are two com-mon phenomena in swarm intelligence. Based onthe solutions convergence and divergence, solutionsare guided toward the better and better areas. Inswarm intelligence algorithms, premature conver-gence happens partially due to the solutions gettingclustered together, and not diverging again. Thepremature convergence also happens in the brainstorm optimization algorithm. To prevent the pre-mature convergence, algorithm’s exploration abilityand exploitation ability should be balanced duringthe search.

The population diversity is a measure of explo-ration and exploitation. Based on the population di-versity changing measurement, the state of explo-ration and exploitation can be obtained. The pop-ulation diversity definition is the first step to givean accurate observation of the search state. Manyapproaches have been introduced based on the ideathat prevents solutions from clustering too tightlyin one region of the search space to achieve greatpossibility to “jump out” of local optima [50].

In this paper, we introduce a population diver-sity definition of the brain storm optimization algo-rithm, and test several kinds of diversity enhanced


Table 2. Result of brain storm optimization solving unimodal and multimodal benchmark functions. Allalgorithms are run for 50 times, where “best”, “median”, “worst”, and “mean” indicate the best, median,

worst, and mean of the best fitness values for all runs, respectively.

Func. fmin Dim Best Median Worst Mean Std. Dev.

f0 −450.0original -283.7674 69.3182 1011.455 128.2867 268.266

half -413.8930 -292.6541 -73.4257 -282.6678 78.2084decrease -389.3168 -296.8813 89.42494 -279.8117 93.4272increase -401.3942 -222.5723 21.8686 -233.6816 88.4352

f1 −330.0original -329.9999 -329.9987 -329.0885 -329.9615 0.16351

half -329.9999 -329.9975 -329.9844 -329.9968 0.00311decrease -329.9999 -329.9978 -329.9872 -329.9969 0.00297increase -329.9999 -329.9982 -329.9907 -329.9974 0.00227

f2 450.0original 1674.185 3469.0662 6521.9105 3715.998 1155.126

half 1236.369 1715.393 2518.608 1770.651 365.281decrease 1013.162 1734.731 2432.571 1682.709 311.921increase 1302.540 2088.135 2941.036 2078.788 417.398

f3 330.0original 1461 1989 4036 2121.96 450.6883

half 765 1122 1548 1110.22 163.962decrease 785 1095 1536 1086.92 173.0487increase 875 1185 1768 1221.74 205.0707

f4 −450.0original -449.9989 -449.9966 -449.9933 -449.9963 0.00116


f5 180.0original 221.5290 227.5745 288.2411 232.1447 15.4617


f6 −330.0original -300.1512 -265.3277 -224.5344 -264.9098 17.7013


f7 450.0original 482 526 619 528.68 27.3447


f8 180.0original 188.2361 190.7374 192.2957 190.6498 0.84468


f9 120.0original 129.8876 134.4022 142.2110 134.6174 2.88916


f10 330.0original 332.1512 336.5663 344.5289 337.1611 2.89567



Table 2. Result of brain storm optimization solving unimodal and multimodal benchmark functions. Allalgorithms are run for 50 times, where “best”, “median”, “worst”, and “mean” indicate the best, median,

worst, and mean of the best fitness values for all runs, respectively.

Func. fmin Dim Best Median Worst Mean Std. Dev.

f0 −450.0original -283.7674 69.3182 1011.455 128.2867 268.266

half -413.8930 -292.6541 -73.4257 -282.6678 78.2084decrease -389.3168 -296.8813 89.42494 -279.8117 93.4272increase -401.3942 -222.5723 21.8686 -233.6816 88.4352

f1 −330.0original -329.9999 -329.9987 -329.0885 -329.9615 0.16351


f2 450.0original 1674.185 3469.0662 6521.9105 3715.998 1155.126


f3 330.0original 1461 1989 4036 2121.96 450.6883


f4 −450.0original -449.9989 -449.9966 -449.9933 -449.9963 0.00116


f5 180.0original 221.5290 227.5745 288.2411 232.1447 15.4617


f6 −330.0original -300.1512 -265.3277 -224.5344 -264.9098 17.7013


f7 450.0original 482 526 619 528.68 27.3447


f8 180.0original 188.2361 190.7374 192.2957 190.6498 0.84468


f9 120.0original 129.8876 134.4022 142.2110 134.6174 2.88916


f10 330.0original 332.1512 336.5663 344.5289 337.1611 2.89567



0 500 1000 1500−2

0

2

4

6

8

10

12x 104

originalhalfdecreaseincrease

0 500 1000 1500−0.5

0

0.5

1

1.5

2

2.5

3x 1020


0 500 1000 1500103

104

105

106


(a) Parabolic f0 (b) Schwefel’s P2.22 f1 (c) Schwefel’s P1.2 f2

0 500 1000 1500103

104

105

106


0 500 1000 1500

−102.2

−102.3

−102.4

−102.5

−102.6


(d) Step f3 (e) Quartic Noise f4

Figure 3. The average performance of the brain storm optimization algorithm solving unimodal functions.

0 500 1000 1500102

103

104

105

106

107


0 500 1000 1500−300

−200

−100

0

100

200

300

400

500


0 500 1000 1500102

103

104


(a) Rosenbrock f5 (b) Rastrigin f6 (c) Noncontinuous Rastrigin f7

0 500 1000 1500

102.28

102.29

102.3


0 500 1000 1500102

103

104


0 500 1000 1500102

103

104

105

106

107

108

109


(d) Ackley f8 (e) Griewank f9 (f) Generalized Penalized f10

Figure 4. The average performance of the brain storm optimization algorithm solving multimodalfunctions.


0 500 1000 150010−2

10−1

100

101

102


0 500 1000 150010−3

10−2

10−1

100

101


0 500 1000 150010−2

10−1

100

101

102



0 500 1000 150010−2

10−1

100

101

102


0 500 1000 150010−3

10−2

10−1

100



Figure 5. The population diversity monitor of the brain storm optimization algorithm solving unimodalfunctions.

0 500 1000 150010−2

10−1

100

101


0 500 1000 150010−3

10−2

10−1

100

101


0 500 1000 150010−3

10−2

10−1

100

101



0 500 1000 150010−2

10−1

100

101

102


0 500 1000 150010−1

100

101

102

103


0 500 1000 150010−2

10−1

100

101

102



Figure 6. The population diversity monitor of the brain storm optimization algorithm solving multimodalfunctions.


0 500 1000 150010−2

10−1

100

101

102


0 500 1000 150010−3

10−2

10−1

100

101


0 500 1000 150010−2

10−1

100

101

102



0 500 1000 150010−2

10−1

100

101

102


0 500 1000 150010−3

10−2

10−1

100



Figure 5. The population diversity monitor of the brain storm optimization algorithm solving unimodalfunctions.

0 500 1000 150010−2

10−1

100

101


0 500 1000 150010−3

10−2

10−1

100

101


0 500 1000 150010−3

10−2

10−1

100

101



0 500 1000 150010−2

10−1

100

101

102


0 500 1000 150010−1

100

101

102

103


0 500 1000 150010−2

10−1

100

101

102



Figure 6. The population diversity monitor of the brain storm optimization algorithm solving multimodalfunctions.


strategies to help solutions jump out of local op-tima. The experimental study shows that the per-formance of optimization is improved by the popu-lation diversity enhancement. The population diver-sity also should be monitored in the brain storm op-timization algorithm solving multiobjective prob-lems. The relationship between the population di-versity changes and the performance of BSO al-gorithm, and the properties of population diversitychanges with different problems also needs moreanalysis. In general, the brain storm optimizationalgorithm is a young and promising algorithm; thereare many fields which are under investigation.

Acknowledgment

This work was carried out at the InternationalDoctoral Innovation Centre (IDIC). The authors ac-knowledge the financial support from Ningbo Ed-ucation Bureau, Ningbo Science and TechnologyBureau, China’s MOST and The University of Not-tingham. This work is also partially supported byNational Natural Science Foundation of China un-der grant No.71240015, 71402103, 61273367; andNingbo Science & Technology Bureau (Science andTechnology Project No.2012B10055). This is anextension from CEC 2014 conference paper “Main-taining Population Diversity in Brain Storm Opti-mization Algorithm” [26].

References[1] K. A. De Jong, “An analysis of the behavior of a

class of genetic adaptive systems,” Ph.D. disserta-tion, Department of Computer and CommunicationSciences, University of Michigan, August 1975.

[2] M. L. Mauldin, “Maintaining diversity in geneticsearch,” in Proceedings of the National Confer-ence on Artificial Intelligence (AAAI 1984), August1984, pp. 247–250.

[3] D. E. Goldberg, Genetic Algorithms in Search, Op-timization and Machine Learning. Boston, MA,USA: Addison-Wesley Longman Publishing Co.,Inc., 1989.

[4] A. E. Eiben, R. Hinterding, and Z. Michalewicz,“Parameter control in evolutionary algorithms,”IEEE Transactions on Evolutionary Computation,vol. 3, no. 2, pp. 124–141, July 1999.

[5] S. F. Adra, T. J. Dodd, I. A. Griffin, and P. J. Flem-ing, “Convergence acceleration operator for mul-

tiobjective optimization,” IEEE Transactions onEvolutionary Computation, vol. 12, no. 4, pp. 825–847, August 2009.

[6] Y. Jin and B. Sendhoff, “A systems approach toevolutionary multiobjective structural optimizationand beyond,” IEEE Computational IntelligenceMagazine, vol. 4, no. 3, pp. 62–76, August 2009.

[7] R. K. Sundaram, A First Course in OptimizationTheory. Cambridge University Press, 1996.

[8] R. C. Purshouse and P. J. Fleming, “On the evo-lutionary optimization of many conflicting objec-tives,” IEEE Transactions on Evolutionary Com-putation, vol. 11, no. 6, pp. 770–784, December2007.

[9] S. F. Adra and P. J. Fleming, “Diversity man-agement in evolutionary many-objective optimiza-tion,” IEEE Transactions on Evolutionary Compu-tation, vol. 15, no. 2, pp. 183–195, April 2011.

[10] A. Engelbrecht, X. Li, M. Middendorf, and L. M.Gambardella, “Editorial special issue: Swarm in-telligence,” IEEE Transactions on EvolutionaryComputation, vol. 13, no. 4, pp. 677–680, August2009.

[11] J. Kennedy, R. Eberhart, and Y. Shi, Swarm Intelli-gence. Morgan Kaufmann Publisher, 2001.

[12] E. K. Burke, S. Gustafson, and G. Kendall, “A sur-vey and analysis of diversity measures in geneticprogramming,” in Proceedings of the Genetic andEvolutionary Computation Conference (GECCO2002). San Francisco, CA, USA: Morgan Kauf-mann Publishers Inc., 2002, pp. 716–723.

[13] Y. Shi and R. Eberhart, “Population diver-sity of particle swarms,” in Proceedings of the2008 Congress on Evolutionary Computation(CEC2008), 2008, pp. 1063–1067.

[14] ——, “Monitoring of particle swarm optimiza-tion,” Frontiers of Computer Science, vol. 3, no. 1,pp. 31–37, March 2009.

[15] S. Cheng and Y. Shi, “Diversity control in particleswarm optimization,” in Proceedings of 2011 IEEESymposium on Swarm Intelligence (SIS 2011),Paris, France, April 2011, pp. 110–118.

[16] S. Cheng, Y. Shi, and Q. Qin, “ExperimentalStudy on Boundary Constraints Handling in Parti-cle Swarm Optimization: From Population Diver-sity Perspective,” International Journal of SwarmIntelligence Research (IJSIR), vol. 2, no. 3, pp. 43–69, July-September 2011.

[17] S. Cheng, “Population diversity in particle swarmoptimization: Definition, observation, control, andapplication,” Ph.D. dissertation, Department of


Electrical Engineering and Electronics, Universityof Liverpool, May 2013.

[18] S. Cheng, Y. Shi, and Q. Qin, “A study of nor-malized population diversity in particle swarm op-timization,” International Journal of Swarm Intel-ligence Research (IJSIR), vol. 4, no. 1, pp. 1–34,January-March 2013.

[19] Y. Shi, “Brain storm optimization algorithm,” inAdvances in Swarm Intelligence, ser. Lecture Notesin Computer Science, Y. Tan, Y. Shi, Y. Chai, andG. Wang, Eds. Springer Berlin/Heidelberg, 2011,vol. 6728, pp. 303–309.

[20] ——, “An optimization algorithm based on brain-storming process,” International Journal of SwarmIntelligence Research (IJSIR), vol. 2, no. 4, pp. 35–62, October-December 2011.

[21] X. Guo, Y. Wu, and L. Xie, “Modified brain stormoptimization algorithm for multimodal optimiza-tion,” in Advances in Swarm Intelligence, ser. Lec-ture Notes in Computer Science, Y. Tan, Y. Shi,and C. A. C. Coello, Eds. Springer InternationalPublishing, 2014, vol. 8795, pp. 340–351.

[22] J. Xue, Y. Wu, Y. Shi, and S. Cheng, “Brainstorm optimization algorithm for multi-objectiveoptimization problems,” in Advances in Swarm In-telligence, ser. Lecture Notes in Computer Science,Y. Tan, Y. Shi, and Z. Ji, Eds. Springer Berlin /Heidelberg, 2012, vol. 7331, pp. 513–519.

[23] L. Xie and Y. Wu, “A modified multi-objective op-timization based on brain storm optimization al-gorithm,” in Advances in Swarm Intelligence, ser.Lecture Notes in Computer Science, Y. Tan, Y. Shi,and C. Coello, Eds. Springer International Pub-lishing, 2014, vol. 8795, pp. 328–339.

[24] Z.-H. Zhan, W.-N. Chen, Y. Lin, Y.-J. Gong,Y. long Li, and J. Zhang, “Parameter investigationin brain storm optimization,” in 2013 IEEE Sympo-sium on Swarm Intelligence (SIS), April 2013, pp.103–110.

[25] S. Cheng, Y. Shi, Q. Qin, and S. Gao, “Solutionclustering analysis in brain storm optimization al-gorithm,” in Proceedings of The 2013 IEEE Sym-posium on Swarm Intelligence, (SIS 2013). Sin-gapore: IEEE, 2013, pp. 111–118.

[26] S. Cheng, Y. Shi, Q. Qin, T. O. Ting, andR. Bai, “Maintaining population diversity in brainstorm optimization algorithm,” in Proceedings of2014 IEEE Congress on Evolutionary Computa-tion, (CEC 2014). Beijing, China: IEEE, 2014,pp. 3230–3237.

[27] Z. hui Zhan, J. Zhang, Y. hui Shi, and H. lin Liu,“A modified brain storm optimization,” in Evolu-tionary Computation (CEC), 2012 IEEE Congresson, June 2012, pp. 1–8.

[28] H. Jadhav, U. Sharma, J. Patel, and R. Roy, “Brainstorm optimization algorithm based economic dis-patch considering wind power,” in 2012 IEEEInternational Conference on Power and Energy(PECon 2012), Kota Kinabalu, Malaysia, Decem-ber 2012, pp. 588–593.

[29] C. Sun, H. Duan, and Y. Shi, “Optimal satellite for-mation reconfiguration based on closed-loop brainstorm optimization,” IEEE Computational Intelli-gence Magazine, vol. 8, no. 4, pp. 39–51, Novem-ber 2013.

[30] H. Duan, S. Li, and Y. Shi, “Predatorcprey brainstorm optimization for dc brushless motor,” IEEETransactions on Magnetics, vol. 49, no. 10, pp.5336–5340, October 2013.

[31] H. Duan and C. Li, “Quantum-behaved brain stormoptimization approch to solving loney’s solenoidproblem,” IEEE Transactions on Magnetics, p. inpress, 2014.

[32] Y. Tan and Y. Zhu, “Fireworks algorithm for opti-mization,” in Advances in Swarm Intelligence, ser.Lecture Notes in Computer Science, Y. Tan, Y. Shi,and K. C. Tan, Eds. Springer Berlin Heidelberg,2010, vol. 6145, pp. 355–364.

[33] S. Zheng, A. Janecek, and Y. Tan, “Enhancedfireworks algorithm,” in 2013 IEEE Congress onEvolutionary Computation (CEC), June 2013, pp.2069–2077.

[34] Y. Shi, J. Xue, and Y. Wu, “Multi-objective op-timization based on brain storm optimization al-gorithm,” International Journal of Swarm Intelli-gence Research (IJSIR), vol. 4, no. 3, pp. 1–21,July-September 2013.

[35] C. Darwin, On the Origin of Species by Means ofNatural Selection, or the Preservation of FavouredRaces in the Struggle for Life, 5th ed. London:John Murray, 1869.

[36] M. Affenzeller, S. Winkler, S. Wagner, and A. Be-ham, Genetic Algorithms and Genetic Program-ming: Modern Concepts and Practical Appli-cations, ser. Numerical Insights, A. Sydow, Ed.Chapman & Hall/CRC Press, 2009, vol. 6.

[37] S. Cheng, Y. Shi, and Q. Qin, “Dynamical exploita-tion space reduction in particle swarm optimizationfor solving large scale problems,” in Proceedingsof 2012 IEEE Congress on Evolutionary Compu-tation, (CEC 2012). Brisbane, Australia: IEEE,2012, pp. 3030–3037.


Electrical Engineering and Electronics, Universityof Liverpool, May 2013.

[18] S. Cheng, Y. Shi, and Q. Qin, “A study of nor-malized population diversity in particle swarm op-timization,” International Journal of Swarm Intel-ligence Research (IJSIR), vol. 4, no. 1, pp. 1–34,January-March 2013.

[19] Y. Shi, “Brain storm optimization algorithm,” inAdvances in Swarm Intelligence, ser. Lecture Notesin Computer Science, Y. Tan, Y. Shi, Y. Chai, andG. Wang, Eds. Springer Berlin/Heidelberg, 2011,vol. 6728, pp. 303–309.

[20] ——, “An optimization algorithm based on brain-storming process,” International Journal of SwarmIntelligence Research (IJSIR), vol. 2, no. 4, pp. 35–62, October-December 2011.

[21] X. Guo, Y. Wu, and L. Xie, “Modified brain stormoptimization algorithm for multimodal optimiza-tion,” in Advances in Swarm Intelligence, ser. Lec-ture Notes in Computer Science, Y. Tan, Y. Shi,and C. A. C. Coello, Eds. Springer InternationalPublishing, 2014, vol. 8795, pp. 340–351.

[22] J. Xue, Y. Wu, Y. Shi, and S. Cheng, “Brainstorm optimization algorithm for multi-objectiveoptimization problems,” in Advances in Swarm In-telligence, ser. Lecture Notes in Computer Science,Y. Tan, Y. Shi, and Z. Ji, Eds. Springer Berlin /Heidelberg, 2012, vol. 7331, pp. 513–519.

[23] L. Xie and Y. Wu, “A modified multi-objective op-timization based on brain storm optimization al-gorithm,” in Advances in Swarm Intelligence, ser.Lecture Notes in Computer Science, Y. Tan, Y. Shi,and C. Coello, Eds. Springer International Pub-lishing, 2014, vol. 8795, pp. 328–339.

[24] Z.-H. Zhan, W.-N. Chen, Y. Lin, Y.-J. Gong,Y. long Li, and J. Zhang, “Parameter investigationin brain storm optimization,” in 2013 IEEE Sympo-sium on Swarm Intelligence (SIS), April 2013, pp.103–110.

[25] S. Cheng, Y. Shi, Q. Qin, and S. Gao, “Solutionclustering analysis in brain storm optimization al-gorithm,” in Proceedings of The 2013 IEEE Sym-posium on Swarm Intelligence, (SIS 2013). Sin-gapore: IEEE, 2013, pp. 111–118.

[26] S. Cheng, Y. Shi, Q. Qin, T. O. Ting, andR. Bai, “Maintaining population diversity in brainstorm optimization algorithm,” in Proceedings of2014 IEEE Congress on Evolutionary Computa-tion, (CEC 2014). Beijing, China: IEEE, 2014,pp. 3230–3237.

[27] Z. hui Zhan, J. Zhang, Y. hui Shi, and H. lin Liu,“A modified brain storm optimization,” in Evolu-tionary Computation (CEC), 2012 IEEE Congresson, June 2012, pp. 1–8.

[28] H. Jadhav, U. Sharma, J. Patel, and R. Roy, “Brainstorm optimization algorithm based economic dis-patch considering wind power,” in 2012 IEEEInternational Conference on Power and Energy(PECon 2012), Kota Kinabalu, Malaysia, Decem-ber 2012, pp. 588–593.

[29] C. Sun, H. Duan, and Y. Shi, “Optimal satellite for-mation reconfiguration based on closed-loop brainstorm optimization,” IEEE Computational Intelli-gence Magazine, vol. 8, no. 4, pp. 39–51, Novem-ber 2013.

[30] H. Duan, S. Li, and Y. Shi, “Predatorcprey brainstorm optimization for dc brushless motor,” IEEETransactions on Magnetics, vol. 49, no. 10, pp.5336–5340, October 2013.

[31] H. Duan and C. Li, “Quantum-behaved brain stormoptimization approch to solving loney’s solenoidproblem,” IEEE Transactions on Magnetics, p. inpress, 2014.

[32] Y. Tan and Y. Zhu, “Fireworks algorithm for opti-mization,” in Advances in Swarm Intelligence, ser.Lecture Notes in Computer Science, Y. Tan, Y. Shi,and K. C. Tan, Eds. Springer Berlin Heidelberg,2010, vol. 6145, pp. 355–364.

[33] S. Zheng, A. Janecek, and Y. Tan, “Enhancedfireworks algorithm,” in 2013 IEEE Congress onEvolutionary Computation (CEC), June 2013, pp.2069–2077.

[34] Y. Shi, J. Xue, and Y. Wu, “Multi-objective op-timization based on brain storm optimization al-gorithm,” International Journal of Swarm Intelli-gence Research (IJSIR), vol. 4, no. 3, pp. 1–21,July-September 2013.

[35] C. Darwin, On the Origin of Species by Means ofNatural Selection, or the Preservation of FavouredRaces in the Struggle for Life, 5th ed. London:John Murray, 1869.

[36] M. Affenzeller, S. Winkler, S. Wagner, and A. Be-ham, Genetic Algorithms and Genetic Program-ming: Modern Concepts and Practical Appli-cations, ser. Numerical Insights, A. Sydow, Ed.Chapman & Hall/CRC Press, 2009, vol. 6.

[37] S. Cheng, Y. Shi, and Q. Qin, “Dynamical exploita-tion space reduction in particle swarm optimizationfor solving large scale problems,” in Proceedingsof 2012 IEEE Congress on Evolutionary Compu-tation, (CEC 2012). Brisbane, Australia: IEEE,2012, pp. 3030–3037.


[38] ——, “Population diversity based study on searchinformation propagation in particle swarm opti-mization,” in Proceedings of 2012 IEEE Congresson Evolutionary Computation, (CEC 2012). Bris-bane, Australia: IEEE, 2012, pp. 1272–1279.

[39] K. P. Murphy, Machine Learning: A Probabilis-tic Perspective, ser. Adaptive computation and ma-chine learning series. Cambridge, Massachusetts:The MIT Press, 2012.

[40] D. Zhou, Y. Shi, and S. Cheng, “Brain storm opti-mization algorithm with modified step-size and in-dividual generation,” in Advances in Swarm Intel-ligence, ser. Lecture Notes in Computer Science,Y. Tan, Y. Shi, and Z. Ji, Eds. Springer Berlin /Heidelberg, 2012, vol. 7331, pp. 243–252.

[41] S. Cheng, Y. Shi, and Q. Qin, “Population di-versity of particle swarm optimizer solving singleand multi-objective problems,” International Jour-nal of Swarm Intelligence Research (IJSIR), vol. 3,no. 4, pp. 23–60, 2012.

[42] ——, “Promoting diversity in particle swarm opti-mization to solve multimodal problems,” in Neu-ral Information Processing, ser. Lecture Notesin Computer Science, B.-L. Lu, L. Zhang, andJ. Kwok, Eds. Springer Berlin / Heidelberg, 2011,vol. 7063, pp. 228–237.

[43] W. Cedeno and V. R. Vemuri, “On the use of nich-ing for dynamic landscapes,” in Proceedings of1997 IEEE Congress on Evolutionary Computa-tion, (CEC 1997). IEEE, 1997, pp. 361–366.

[44] A. Della Cioppa, C. De Stefano, and A. Marcelli,“Where are the niches? dynamic fitness sharing,”IEEE Transactions on Evolutionary Computation,vol. 11, no. 4, pp. 453–465, August 2007.

[45] A. Ghosh, S. Tsutsui, and H. Tanaka, “Func-tion optimization in nonstationary environment us-ing steady state genetic algorithms with agingof individuals,” in Proceedings of 1998 IEEECongress on Evolutionary Computation, (CEC1998). IEEE, 1998, pp. 666–671.

[46] Y. Jin and B. Sendhoff, “Constructing dynamic op-timization test problems using the multi-objectiveoptimization concept,” in Applications of Evolu-tionary Computing, ser. Lecture Notes in Com-puter Science, G. R. Raidl, S. Cagnoni, J. Branke,D. W. Corne, R. Drechsler, Y. Jin, C. G. John-son, P. Machado, E. Marchiori, F. Rothlauf, G. D.Smith, and G. Squillero, Eds. Springer Berlin /Heidelberg, 2004, vol. 3005, pp. 525–536.

[47] D. H. Wolpert and W. G. Macready, “No free lunchtheorems for optimization,” IEEE Transactions onEvolutionary Computation, vol. 1, no. 1, pp. 67–82, April 1997.

[48] X. Yao, Y. Liu, and G. Lin, “Evolutionary program-ming made faster,” IEEE Transactions on Evolu-tionary Computation, vol. 3, no. 2, pp. 82–102,July 1999.

[49] J. J. Liang, A. K. Qin, P. N. Suganthan,and S. Baskar, “Comprehensive learning particleswarm optimizer for global optimization of mul-timodal functions,” IEEE Transactions on Evolu-tionary Computation, vol. 10, no. 3, pp. 281–295,June 2006.

[50] T. Blackwell and P. Bentley, “Don’t push me!collision-avoiding swarms,” in Proceedings of TheFourth Congress on Evolutionary Computation(CEC 2002), May 2002, pp. 1691–1696.

JAISCR, 2014, Vol. 4, No. 2, pp. 99

APPLYING LCS TO AFFECTIVE IMAGECLASSIFICATION IN SPATIAL-FREQUENCY DOMAIN

Po-Ming Lee1 and Tzu-Chien Hsiao2

1Institute of Computer Science and Engineering, Department of Computer Science, National Chiao TungUniversityal. 1001 University Rd., Hsinchu, Taiwan, R.O.C.

2Department of Computer Science, Institute of Biomedical Engineering, and Biomedical Electronics TranslationalResearch Center and Biomimetic Systems Research Center in National Chiao Tung Universityal. 1001 University

Rd., Hsinchu, Taiwan, R.O.C.

Abstract

Recent studies have utilizes color, texture, and composition information of images toachieve affective image classification. However, the features related to spatial-frequencydomain that were proven to be useful for traditional pattern recognition have not beentested in this field yet. Furthermore, the experiments conducted by previous studies arenot internationally-comparable due to the experimental paradigm adopted. In addition,contributed by recent advances in methodology, that are, Hilbert-Huang Transform (HHT)(i.e. Empirical Mode Decomposition (EMD) and Hilbert Transform (HT)), the resolutionof frequency analysis has been improved. Hence, the goal of this research is to achievethe affective image-classification task by adopting a standard experimental paradigm in-troduces by psychologists in order to produce international-comparable and reproducibleresults; and also to explore the affective hidden patterns of images in the spatial-frequencydomain. To accomplish these goals, multiple human-subject experiments were conductedin laboratory. Extended Classifier Systems (XCSs) was used for model building becausethe XCS has been applied to a wide range of classification tasks and proved to be compet-itive in pattern recognition. To exploit the information in the spatial-frequency domain,the traditional EMD has been extended to a two-dimensional version. To summarize, themodel built by using the XCS achieves Area Under Curve (AUC) = 0.91 and accuracy rateover 86%. The result of the XCS was compared with other traditional machine-learningalgorithms (e.g., Radial-Basis Function Network (RBF Network)) that are normally usedfor classification tasks. Contributed by proper selection of features for model building,user-independent findings were obtained. For example, it is found that the horizontal vi-sual stimulations contribute more to the emotion elicitation than the vertical visual stimu-lation. The effect of hue, saturation, and brightness; is also presented.

1 Introduction

1.1 Scope

People experience emotion in their daily life byfeeling happy, angry and various emotions inducedby stimulus and events that are emotionally rele-vant. Because it is human nature to pursue happi-

ness and avoid pain, a research finding related tohuman emotion can be easily transferred to diverseapplications. For example, behavioral economics[1], media studies and advertisement [2, 3]. Someresearches focused on the use of emotional relevantstimulus to attract the attention of subjects, and tomake subjects remember more on the product pre-sented [3]. In the area of image, print advertise-

– 12310.1515/jaiscr-2015-0002

100 Po-Ming Lee and Tzu-Chien Hsiao

ment and the use of affective images for attractingthe attention of subjects during web browsing werereported [2]. Guideline of extracting emotional rel-evant features in a web page is also available [4].

Due to the development of personal computer,software, and World-Wide-Web (WWW), peoplenowadays generate huge amount of content (e.g.,daily news, articles on variety topics and personaldata) and upload them to the internet every day. Toenable end users to explore the content on the in-ternet, Google and Yahoo! such the search engineprovider index these contents. Currently most of theweb content indexing works are done based on text-based technologies. Although text-based indexingtechnologies are suitable for articles, the limitationof the text-based method is obvious when imagesare the indexing target. Traditionally, image searchis done based on the file name of the target imageand perhaps the description (e.g., tags) of the targetimage. In the last decade, image search based oncontent has been provided by Google Picture andYahoo! Image Search. However, there was lack ofattention to the development of the techniques ofindexing the affective characteristics of images, de-spite equipped with such the technique, the industrywould be able to design new applications related tohuman feelings and better user experience. For ex-ample, an application that leads end users to targetimages that may potentially ease their “feelings”.

1.2 Motivation

The affective characteristic of an image is de-fined by the capability of an image in eliciting emo-tional responses. Human beings have the ability torecognize the affective characteristic embedded inan image. Hence, to index the affective characteris-tics of images on the internet, an intuitive approachis to have a large number of people manually rate allthe images and calculate descriptive statistics fromthe obtained ratings. However, this approach maybe impractical due to the cost of manpower and theincreasing speed of images around the world.

On the other hand, using Artificial Intelligence(AI) and Machine Learning (ML) techniques, abroad range of intelligent machines have been de-signed to perform different pattern recognition tasks[5-7]. An intelligent machine that can automaticallyclassify images based on their affective characteris-tics could be built by given a number of instances

with proper selected features. Due to the lack of at-tention on this issue in the literature, this study aimsto build an intelligent machine to perform affectiveimage-classification task.

1.3 Research Objectives

he proposed hypothesis is that a trained intel-ligent machine can classify images based on theirability in eliciting emotions, through the basic prop-erties of these images. Wilson’s Extended Classi-fier System (XCS) [8], a well-tested accuracy-basedLearning Classifier System (LCS) model, is to beused to build the classification models in this re-search. The XCS is proven to be capable of ex-tracting complete, general, and readable rules froma previously unknown dataset, which motivated itssuitability for this research work.

The overall goal of this research is to demon-strate a novel method to classify images based ontheir ability in eliciting emotions. This goal is di-vided into the following two subgoals.

– To develop an intelligent machine that can iden-tify images based on their capability in inducingemotions. To examine the effect of basic proper-ties (i.e. hue, saturation, and brightness) of im-ages on their capability in inducing emotions.

– To develop an intelligent machine that can iden-tify images based on their capability in inducingemotions. To examine the effect of the proper-ties of images in the spatial-frequency domainon the capability of these images in inducingemotions.

All the human-subject experiments conductedin this research, and the manner of using data ob-tained from human subjects were approved (Proto-col No: 100-014-E and NCTU-REC-102-007) bythe Institution Review Board (IRB) of the NationalTaiwan University Hospital Hsinchu Branch andthe IRB of National Chiao-Tung University, respec-tively.

The built models were evaluated using 10-FoldCross Validation (CV) which is a traditional evalu-ation method used in the literature and the resultswere compared with the existing related systems.In addition to the demonstration of the modelingbuilding process, this study also aims at providing

101Po-Ming Lee and Tzu-Chien Hsiao

ment and the use of affective images for attractingthe attention of subjects during web browsing werereported [2]. Guideline of extracting emotional rel-evant features in a web page is also available [4].

Due to the development of personal computer,software, and World-Wide-Web (WWW), peoplenowadays generate huge amount of content (e.g.,daily news, articles on variety topics and personaldata) and upload them to the internet every day. Toenable end users to explore the content on the in-ternet, Google and Yahoo! such the search engineprovider index these contents. Currently most of theweb content indexing works are done based on text-based technologies. Although text-based indexingtechnologies are suitable for articles, the limitationof the text-based method is obvious when imagesare the indexing target. Traditionally, image searchis done based on the file name of the target imageand perhaps the description (e.g., tags) of the targetimage. In the last decade, image search based oncontent has been provided by Google Picture andYahoo! Image Search. However, there was lack ofattention to the development of the techniques ofindexing the affective characteristics of images, de-spite equipped with such the technique, the industrywould be able to design new applications related tohuman feelings and better user experience. For ex-ample, an application that leads end users to targetimages that may potentially ease their “feelings”.

1.2 Motivation

The affective characteristic of an image is de-fined by the capability of an image in eliciting emo-tional responses. Human beings have the ability torecognize the affective characteristic embedded inan image. Hence, to index the affective characteris-tics of images on the internet, an intuitive approachis to have a large number of people manually rate allthe images and calculate descriptive statistics fromthe obtained ratings. However, this approach maybe impractical due to the cost of manpower and theincreasing speed of images around the world.

On the other hand, using Artificial Intelligence(AI) and Machine Learning (ML) techniques, abroad range of intelligent machines have been de-signed to perform different pattern recognition tasks[5-7]. An intelligent machine that can automaticallyclassify images based on their affective characteris-tics could be built by given a number of instances

with proper selected features. Due to the lack of at-tention on this issue in the literature, this study aimsto build an intelligent machine to perform affectiveimage-classification task.

1.3 Research Objectives

he proposed hypothesis is that a trained intel-ligent machine can classify images based on theirability in eliciting emotions, through the basic prop-erties of these images. Wilson’s Extended Classi-fier System (XCS) [8], a well-tested accuracy-basedLearning Classifier System (LCS) model, is to beused to build the classification models in this re-search. The XCS is proven to be capable of ex-tracting complete, general, and readable rules froma previously unknown dataset, which motivated itssuitability for this research work.

The overall goal of this research is to demon-strate a novel method to classify images based ontheir ability in eliciting emotions. This goal is di-vided into the following two subgoals.

– To develop an intelligent machine that can iden-tify images based on their capability in inducingemotions. To examine the effect of basic proper-ties (i.e. hue, saturation, and brightness) of im-ages on their capability in inducing emotions.

– To develop an intelligent machine that can iden-tify images based on their capability in inducingemotions. To examine the effect of the proper-ties of images in the spatial-frequency domainon the capability of these images in inducingemotions.

All the human-subject experiments conductedin this research, and the manner of using data ob-tained from human subjects were approved (Proto-col No: 100-014-E and NCTU-REC-102-007) bythe Institution Review Board (IRB) of the NationalTaiwan University Hospital Hsinchu Branch andthe IRB of National Chiao-Tung University, respec-tively.

The built models were evaluated using 10-FoldCross Validation (CV) which is a traditional evalu-ation method used in the literature and the resultswere compared with the existing related systems.In addition to the demonstration of the modelingbuilding process, this study also aims at providing

APPLYING LCS TO AFFECTIVE IMAGE CLASSIFICATION IN . . .

the examination results of the factors that may in-fluence the affective characteristics of images.

1.4 Paper Contribution

This research led to the following major con-tributions to the field of affective computing [9] ingeneral and specifically to the field of affective im-age classification.

This resarch:(1) Demonstrates a two-dimensional version of the Hilbert-Huang Trans-form (2D-HHT) for extracting features in thespatial-frequency domain from images. (2) Demon-strates the model-building process of an intel-ligent machine for performing affective image-classification task. (3) Examines the influence ofbasic properties and also the properties in regardto the spatial-frequency domain on the affectivecharacteristics of images. (4) All the results wereobtained from and validated by human-subject ex-periments.

1.5 Paper Organization

The remainder of this research is organized asfollows. Chapter 2 describes the research method-ology to be used in this work to achieve the over-all goal. This chapter describes the framework,the emotional stimuli, and the instruments to beused for model building. Chapter 3 and chapter 4present major contributions to fulfill the establishedresearch objectives. Chapter 5 concludes this work.

Chapter 2 describes the research paradigmadopted in this work to achieve the overall goal,and briefly describes the instruments used. Chapter2 also provides a detailed description of the XCSsalong with an overview of related studies.

The details of each literature review of relatedwork and the implemented systems are provided ina separate contribution chapter, that are Chapter 3and Chapter 4. These two chapters also providedetails of the problem domains experimented hereand the experimental setup used for collecting dataset or testing and evaluating the developed systems.Chapter 3 and 4 describe the two affect detectorsthat are successfully built from human-subject ex-periments.

Chapter 5 presents the achieved objectives,main conclusions from each contribution chapter,

and the future work that stems from this researchwork.

2 Method and Materials

2.1 Emotion Theories

One of the difficulties in studying emotion isthat how to define it. Although there is a tendencyfor researchers to intuitively define a set of discretebasic emotions (e.g., happy, surprising, sad, and an-gry [10]), recently the dimensional theory of emo-tion, in replacement of the traditional assumption ofdiscrete emotions, has been proposed and demon-strated to be more suitable than the traditional man-ner of describing emotions in a number of studies[11, 12].

Dimensional theory defines emotions by a twodimensional affective space, of which the two di-mensions are “valence” and “arousal”. The valencerepresents that whether the emotion experienced ispleasant whereas the arousal represents the ampli-tude of the emotion aroused. The philosophy of thetheory adopted is illustrated in Figure 1.

Figure 1. Definition of emotions in atwo-dimension affective space

The dimensional theory of emotion explainshow human emotion is elicited and the roles of emo-tional stimulus plays in emotion elicitation by re-lating the emotion theory to the motivational sys-tem of human. The motivational system guides hu-man to behave in the tendency of “approach” or”avoidance” when presented with emotionally rel-evant stimulus (the “stimulus” can be an object, a

results wsystems.modelingat provifactors characte

1.4PapThis recontribucomputinfield of aThis redimensioTransforthe spatiDemonsintelligenimage-clinfluencepropertiedomain images. and valid

1.5PapThe remfollows. methodothe overframewoinstrumeChapter contribuobjectiveChapter adopted and brused.ChadescriptioverviewThe detawork aprovidedthat are chaptersdomainsexperimeor testisystems.affect dehuman-sChapter main c

were compare.In addition tg building priding the exthat may

ristics of ima

per Contrsearch led

utions to ng [9]in genaffective imaesarch: (1) onal versionrm (2D-HHTial-frequencytrates the mont machine lassification e of basic es in regard

on the af(4) All the rdated by hum

per Organmainder of th

Chapter 2ology to be urall goal. Tork, the ements to be u

3andchaputions to fulfies. Chapter 5

2 describein this work riefly descapter 2 alion of thew of related sails of each land the imd in a sepa

Chapter 3 also provid

s experimeental setup uing and ev. Chapter 3etectorsthat asubject exper

5 presents onclusions

ed with the eto the demonrocess, this sxamination

influence ages.

ribution to the fol

the field neral and speage classifica

Demonstrn of the

T) for extracty domain froodel-building

for performtask. (3) properties

d to the spaffective charresults were man-subject e

nization his research i2 describes used in this wThis chapter motional stimused for m

pter 4 prfill the establ5 concludes tes the resea

to achieve thcribes the lso providee XCSs alostudies. literature rev

mplemented arate contriband Chapter

de details oented hereused for collvaluating th3 and 4 deare successfuriments.

the achievfrom each

existing relatnstration of tstudy also aimresults of tthe affecti

llowing majof affecti

ecifically to tation. rates a twHilbert-Huating features om images. (g process of ming affectiExamines tand also t

atial-frequenracteristics obtained fro

experiments.

is organized the resear

work to achiedescribes t

muli, and tmodel buildin

resent majlished researthis work. arch paradighe overall go

instrumenes a detailong with

view of relatsystems a

bution chaptr 4. Thesetwf the problee and tlecting data she developescribethe twully built fro

ved objectiveh contributi

ted the ms the ive

jor ive the

wo-ang

in (2) an

ive the the ncy

of om .

as rch eve the the ng. jor rch

gm oal, nts led an

ted are er, wo em the set

ped wo om

es, on

chth

2

2Othtesesudiofemtoof[1DdidivaexreThill

F

Thhoembymmthw

hapter, and this research w

2Method

.1 EmotioOne of the dihat how to endency for ret of discreturprising, saimensional thf the tradimotions, has o be more suf describing 11, 12].

Dimensional timensional aimensions aralence reprexperienced epresents the he philosoplustrated in F

igure 1Definit

he dimensioow human emmotional stimy relating

motivational motivational she tendency

when presen

the future wwork.

d and Ma

on Theoriifficulties in define it. A

researchers tte basic emd, and angrheory of emitional assubeen propos

uitable than themotions in

theory defineaffective spare “valence”

esents that wis pleasant amplitudeof

phy of the Figure 1.

tion of emotioaffective s

onal theory motion is elimulus plays i

the emotisystem

system guideof “approa

ntedwith em

work that ste

aterials

ies studying emAlthough thto intuitively

motions (e.g.ry [10]), recotion, in rep

umption of sed and demhe traditiona

n a number o

es emotions ace, of which” and “arouswhether the

whereasthef the emotion

theory ad

ons in a two‐dspace

of emotion icited and thein emotion eion theory

of humanes human to ach” or ”avmotionally

ems from

motion is here is a y define a ., happy, ently the

placement discrete

monstrated al manner of studies

by a two h the two sal”. The

emotion e arousal n aroused.

dopted is

dimension

explains e roles of elicitation

to the n. The behavein

voidance” relevant


scenario, or a type of circumstance)[13]. The rea-son of a stimulus to be emotionally relevant couldbe considered as a result of evolutionary process,that is, can be related to the need of survival. Forexample, the stimulus that stimulates positive emo-tions is found related to food and sex, whereasthe stimulus that stimulates negative emotions wasfound related to danger and death. The umbrellaterm “emotionally relevant” can simply be under-stood as a capability to elicit certain emotions of aperson (either positive or negative emotions) [14].

The dimensional theory of emotion has at-tracted substantial attention in the field of psychol-ogy since proposed, and is commonly adopted inlatest studies [3, 11, 15]. On the other hand, brainscientists focused on biological proof. The path-way, the mechanisms of brain, the autonomic ner-vous system, and the organs, that are accounted foremotion responses have being revealed [16]. Otherresearch reported the experimental results on therelationship between emotion and decision making[17], and also the relationship between emotion andmemory [18].

On the path of research in affective image clas-sification [19-23], the achieved accuracy rates arerelatively low. Furthermore, the use of definition in“discrete” emotions also caused these experimentshard to reproduce in countries other than the UnitedStates. Hence, this study adopts the paradigm of thework related to the dimensional theory of emotion.

2.2 Instruments

2.3 International Affective Picture System(IAPS)

Image is a type of visual stimulus that com-monly used in human emotion study for emotion in-duction. However, in past decades, due to culturaldifference, results obtained from different experi-ments were incomparable. Subsequently, a stan-dard affective picture system named InternationalAffective Picture System (IAPS) was proposed [24]to help emotion research experimenter in providingcomparable experimental results.

The IAPS database is developed and distributedby the NIMH Center for Emotion and Attention(CSEA) at the University of Florida to provide aset of normative emotional stimuli for experimen-tal investigations of emotion and attention and can

be easily obtained through e-mail application. TheIAPS contains various affective pictures selectedbased on the statistics obtained from experimentalresults. These pictures are proved to be capablein inducing diverse emotions in the affective space[12]. The IAPS also describes a protocol that in-cludes the constraint about the number of imagesused in a single experiment and the distribution ofthe emotions induced by the images selected.

The IAPS has attracted attention since proposal;various experiments, for example, the empiricalstudies on psychophysiological signals that are re-lated to emotional responses [25], the experimentsof the effects of emotion on memory [26], andthe experiments for identifying the relationship be-tween motivation and emotion [12], were conductedusing IAPS for emotion induction. The images usedin this research were solely selected from this IAPSpublic database.

2.3.1 Self-Assessment Manikin (SAM)

To assess the two dimensions of the affectivespace, the Self-Assessment Manikin (SAM), an af-fective rating system devised by Lang [27] was usedto acquire the affective ratings. The SAM is a non-verbal pictorial assessment that is designed to as-sess the emotional dimensions (i.e. valence andarousal) directly by means of two sets of graphicalmanikins. The SAM has been extensively tested inconjunction with the IAPS and IADS and used indiverse theoretical studies and applications [3, 11,15]. The SAM takes a very short time to complete(5 to 10 seconds). The SAM was reported to becapable of indexing cross-cultural results [28] andthe results obtained using a Semantic Differentialscale (the verbal scale provided in [29]). For usingthe SAM, there is little chance of confusion withterms as in verbal assessments. The SAM that weused was identical to the 9-point rating scale versionof SAM that was used in [30], in which the SAMranges from a smiling, happy figure to a frowning,unhappy figure when representing the affective va-lence dimension. On the other hand, for the arousaldimension, the SAM ranges from an excited, wide-eyed figure to a relaxed, sleepy figure.

Ratings are scored such that 8 represents a highrating on each dimension (i.e. positive valence, higharousal), and 0 represents a low rating on each di-mension (i.e. negative valence, low arousal).


scenario, or a type of circumstance)[13]. The rea-son of a stimulus to be emotionally relevant couldbe considered as a result of evolutionary process,that is, can be related to the need of survival. Forexample, the stimulus that stimulates positive emo-tions is found related to food and sex, whereasthe stimulus that stimulates negative emotions wasfound related to danger and death. The umbrellaterm “emotionally relevant” can simply be under-stood as a capability to elicit certain emotions of aperson (either positive or negative emotions) [14].

The dimensional theory of emotion has at-tracted substantial attention in the field of psychol-ogy since proposed, and is commonly adopted inlatest studies [3, 11, 15]. On the other hand, brainscientists focused on biological proof. The path-way, the mechanisms of brain, the autonomic ner-vous system, and the organs, that are accounted foremotion responses have being revealed [16]. Otherresearch reported the experimental results on therelationship between emotion and decision making[17], and also the relationship between emotion andmemory [18].

On the path of research in affective image clas-sification [19-23], the achieved accuracy rates arerelatively low. Furthermore, the use of definition in“discrete” emotions also caused these experimentshard to reproduce in countries other than the UnitedStates. Hence, this study adopts the paradigm of thework related to the dimensional theory of emotion.

2.2 Instruments

2.3 International Affective Picture System(IAPS)

Image is a type of visual stimulus that com-monly used in human emotion study for emotion in-duction. However, in past decades, due to culturaldifference, results obtained from different experi-ments were incomparable. Subsequently, a stan-dard affective picture system named InternationalAffective Picture System (IAPS) was proposed [24]to help emotion research experimenter in providingcomparable experimental results.

The IAPS database is developed and distributedby the NIMH Center for Emotion and Attention(CSEA) at the University of Florida to provide aset of normative emotional stimuli for experimen-tal investigations of emotion and attention and can

be easily obtained through e-mail application. TheIAPS contains various affective pictures selectedbased on the statistics obtained from experimentalresults. These pictures are proved to be capablein inducing diverse emotions in the affective space[12]. The IAPS also describes a protocol that in-cludes the constraint about the number of imagesused in a single experiment and the distribution ofthe emotions induced by the images selected.

The IAPS has attracted attention since proposal;various experiments, for example, the empiricalstudies on psychophysiological signals that are re-lated to emotional responses [25], the experimentsof the effects of emotion on memory [26], andthe experiments for identifying the relationship be-tween motivation and emotion [12], were conductedusing IAPS for emotion induction. The images usedin this research were solely selected from this IAPSpublic database.

2.3.1 Self-Assessment Manikin (SAM)

To assess the two dimensions of the affectivespace, the Self-Assessment Manikin (SAM), an af-fective rating system devised by Lang [27] was usedto acquire the affective ratings. The SAM is a non-verbal pictorial assessment that is designed to as-sess the emotional dimensions (i.e. valence andarousal) directly by means of two sets of graphicalmanikins. The SAM has been extensively tested inconjunction with the IAPS and IADS and used indiverse theoretical studies and applications [3, 11,15]. The SAM takes a very short time to complete(5 to 10 seconds). The SAM was reported to becapable of indexing cross-cultural results [28] andthe results obtained using a Semantic Differentialscale (the verbal scale provided in [29]). For usingthe SAM, there is little chance of confusion withterms as in verbal assessments. The SAM that weused was identical to the 9-point rating scale versionof SAM that was used in [30], in which the SAMranges from a smiling, happy figure to a frowning,unhappy figure when representing the affective va-lence dimension. On the other hand, for the arousaldimension, the SAM ranges from an excited, wide-eyed figure to a relaxed, sleepy figure.

Ratings are scored such that 8 represents a highrating on each dimension (i.e. positive valence, higharousal), and 0 represents a low rating on each di-mension (i.e. negative valence, low arousal).


Figure 2. The SAM used in this study, in which the upper row represents valence and the lower rowrepresents arousal

2.4 Extended Classifier Systems (XCSs)

2.4.1 Introduction

John Holland proposed Michigan-style Classi-fier Systems (CSs) in 1975, which is the prototypeof the well-known classifier “LCS” [31]. LCS isa rule-based online learning algorithm, which in-corporates Genetic Algorithm (GA) as a rule dis-covery component. Later on, Zeroth-level Classi-fier System (ZCS) was proposed to increase the un-derstandability and performance [32]. ZCS adoptsQ-learning (QL) like Reinforcement Learning (RL)component and retains the GA component. Finally,a classifier system known as XCS was proposed [8].The XCS retain the QL and GA components in theZCS but the fitness value of the rules in the XCS isreferred to the accuracy of each rule on predictingpayoff. Due to the stable performance and the ca-pability to generalize extracted rules, XCS gainedmore attention from the main stream of researchthan other classifiers since it has been proposed.

On the research path of XCS, several versionsof XCS have been developed to suit different needsrequired for real world applications. While the orig-inal XCS provides only binary string as input string(condition input) and with single discrete valueas output, the XCS with real-value input (XCSR)which allows the XCS to accept continuous in-put have been proposed [33-35]. Lanzi suggestedadding internal memory to XCS (named XCSM) forcoping with complex non-Markovian environments[36], XCSI reduces the size of evolved classifierpopulation [37], DXCS as a parallel version of XCSenhances the scalability of XCS [38]. To have XCS

coping with function approximating tasks, Wilsonhimself also provided his idea of having continuousvalue as output in 2002 [39]. This idea is recentlyextended to a more advanced version of XCS named“Extended Classifier System for Function approxi-mation task (XCSF)” [40].

Due to the development of the internet andWWW protocols, researchers nowadays are able togather huge amount of data from the internet. Toextract information from these data, data warehouseand data mining techniques have become popularresearch area. ML algorithms have been adoptedwidely for data mining tasks. XCS itself as oneof the most important classifier has also been cus-tomized to fulfill the requirement of new tasks suchas knowledge discovery and structure identification[41, 42] (e.g., probabilistic CS [43]). One of themost important feature that XCS provides to datamining tasks is its nature of being an online learn-ing algorithm [44]. That is, XCS is able to adapt todynamic environment, and even adapt to the envi-ronment in some extreme cases [45].

In practice, it is obvious that the characteristicsof most of systems change gradually from time totime. The phenomena exist not only in the “sea-sonal changes” and unexpected structural changesin stock market, but also in the area of biomedi-cal engineering (e.g. Heart Rate Variability (HRV)indexes computed from Electrocardiogram (ECG)that are usually used by psychophysiologists andthe physicians in hospital for estimating subject’sphysiological state) and other fields. The character-istic of a system varies from time to time can makeclassification tasks more difficult for traditional su-


Figure 3. System architecture of XCS (for single-step problem)

pervised learning algorithm that does not provideonline learning mechanism.

The XCS has been applied to wide range ofclassification tasks [46, 47] and is proved to be com-petitive for pattern recognition [6]. A considerableamount of literature describing applications basedon XCS has been published in the area of security[48], finance [49, 50], medical research [51-53] andchip design [54, 55]. In the area of finance, XCSis known for its capability of financial time seriesforecasting [56-58]. The XCS was also used fordeveloping personalized desktop by solving user-context classification tasks [43, 59].

2.4.2 XCS Classifier System

Although the XCS can be applied for bothsingle-step problems and multi-step problems [8],this section focuses on describing only the mecha-nisms of XCS in solving single-step problems (i.e.classification task) instead of multi-step problemsfor simplicity. The flow of a typical XCS learn-ing iteration is presented as follows: first, a detec-tor obtains the environmental input (i.e. a binarystring) at the beginning of a typical iteration, anduses the string for the matching process (see upper

left portion of Figure 3); second, during a classi-fier matching process, the XCS searches for classi-fiers in [P] which the covering condition space thatrepresented by a condition string (0, 1, # for eachbit, # indicates a bit that should be ignored, alsotermed “don’t care” bit) includes the binary stringinput. All of the matched classifiers are placed intoa match set (represents by [M]). If the [M] doesnot meet the predefined criterion [8, 60, 61] (usu-ally related to the level of coverage of the suggestedoutput (action string)), XCS applies a mechanismtermed “cover” to generate new classifiers of whichcondition string matches the input binary string andaction string is chosen at random; third, the XCScalculates the fitness weighted average prediction Pi

from each set of classifiers that suggests a same out-put i (i.e. suggesting a same action) after [M] is gen-erated; fourth, all Pi s are used to form a predictionarray (PA) for output selection process. The action-selection regime is usually set to occasionally pickup an output i, which owns the maximal predictedpayoff (i.e., max (Pi)) in the PA, and in the othertime pick up an output randomly for explorationpurpose; fifth, the XCS performs an action based onthe selected output; and finally, after performing theaction to the environment, a payoff function then

Figure 3System architecture of XCS (for single‐step problem)

biomedical engineering (e.g. Heart Rate Variability (HRV) indexes computed from Electrocardiogram (ECG) that are usually used by psychophysiologists and the physicians in hospital for estimating subject’s physiological state) and other fields. The characteristic of a system varies from time to time can make classification tasks more difficult for traditional supervised learning algorithm that does not provide online learning mechanism. The XCS has been applied to wide range of classification tasks [46, 47] and is proved to be competitive for pattern recognition [6]. A considerable amount of literature describing applications based on XCS has been published in the area of security [48], finance [49, 50], medical research [51-53] and chip design [54, 55]. In the area of finance, XCS is known for its capability of financial time series forecasting [56-58]. The XCS was also used for developing personalized desktop by solving user-context classification tasks [43, 59].

2.3.2XCS Classifier System Although the XCS can be applied for both single-step problems and multi-step problems[8], this section focuses on describing only the mechanisms of XCS in solving single-step problems (i.e. classification task) instead of multi-step problems for simplicity. The flow of a typical XCS learning iteration is presented as follows:first, a detector obtains the environmental input(i.e. a binary string) at the beginning of a typical iteration, and uses the string for the matching process (see upper left portion ofFigure 3);second,during a classifier matching process, the XCS searches for classifiers in [P] which the covering condition space that represented by a condition string (0, 1, # for each bit, # indicates a bit that should be ignored, also termed“don’t care” bit) includes thebinary string input.All of the matched classifiers are placed into a match set (represents by [M]). If the [M] does not meet the predefined criterion[8, 60, 61] (usually related to the level of coverage of the


Figure 3. System architecture of XCS (for single-step problem)

pervised learning algorithm that does not provideonline learning mechanism.

The XCS has been applied to wide range ofclassification tasks [46, 47] and is proved to be com-petitive for pattern recognition [6]. A considerableamount of literature describing applications basedon XCS has been published in the area of security[48], finance [49, 50], medical research [51-53] andchip design [54, 55]. In the area of finance, XCSis known for its capability of financial time seriesforecasting [56-58]. The XCS was also used fordeveloping personalized desktop by solving user-context classification tasks [43, 59].

2.4.2 XCS Classifier System

Although the XCS can be applied for bothsingle-step problems and multi-step problems [8],this section focuses on describing only the mecha-nisms of XCS in solving single-step problems (i.e.classification task) instead of multi-step problemsfor simplicity. The flow of a typical XCS learn-ing iteration is presented as follows: first, a detec-tor obtains the environmental input (i.e. a binarystring) at the beginning of a typical iteration, anduses the string for the matching process (see upper

left portion of Figure 3); second, during a classi-fier matching process, the XCS searches for classi-fiers in [P] which the covering condition space thatrepresented by a condition string (0, 1, # for eachbit, # indicates a bit that should be ignored, alsotermed “don’t care” bit) includes the binary stringinput. All of the matched classifiers are placed intoa match set (represents by [M]). If the [M] doesnot meet the predefined criterion [8, 60, 61] (usu-ally related to the level of coverage of the suggestedoutput (action string)), XCS applies a mechanismtermed “cover” to generate new classifiers of whichcondition string matches the input binary string andaction string is chosen at random; third, the XCScalculates the fitness weighted average prediction Pi

from each set of classifiers that suggests a same out-put i (i.e. suggesting a same action) after [M] is gen-erated; fourth, all Pi s are used to form a predictionarray (PA) for output selection process. The action-selection regime is usually set to occasionally pickup an output i, which owns the maximal predictedpayoff (i.e., max (Pi)) in the PA, and in the othertime pick up an output randomly for explorationpurpose; fifth, the XCS performs an action based onthe selected output; and finally, after performing theaction to the environment, a payoff function then


determines payoff (i.e. a numerical value) for XCSto update the classifiers. The payoff function is afunction predefined by the user of XCS, to interpretthe environmental feedback into numeric form ofpayoff (as a reward or punishment). The XCS usesthe payoff in a RL component to update parameters(p, e, and F) of each classifier. The update processis held in only an action set (represents by [A]). The[A] is a set of classifiers (see bottom portion of Fig-ure 3) that all suggest outputting i. The classifiers in[A] are the classifiers responsible for the payoff (de-termined from the environmental feedback) causedby the performed action. During the update process,the Rule (Classifier) Discovery part of XCS (seeleft bottom portion of Figure 3), that is, the GA istriggered occasionally to search for potentially ac-curate classifiers in the classifier (condition-actionstring representation) space. In addition, XCS per-forms subsumption in both the update process andGA, to enable “Macro-Classifier”s (classifiers thatare more general than others) to subsume other clas-sifiers in order to reduce the number of redundant,overlapped classifiers. The remainder of this chap-ter provides details on the critical components ofXCS.

2.4.3 RL Component

The RL component of XCS applies a QL-styleupdate to the parameters of each classifier in [A]during the update process. First the XCS updatesthe prediction payoff p base on the received pay-off using p ← p+β(R−p), in which R representsthe received payoff and β represents learning rate(0 < β ≤ 1). Second, the prediction error ε is up-dated using ε ← ε+β(|R−p|−ε). Third, the XCSupdates the fitness value F (used for the classifierspace searching done by using the GA). The F inXCS is defined based on the accuracy of a classi-fier. Hence, in calculating F, the XCS calculates anaccuracy value first using

κ=

{1, if ε < ε0

α(

εε0

)−ν, o.w.

Equation 1 Calculation of accuracy value of classi-fiers in the XCS

in which the κ is set as 1 when ε is smaller thanε(ε0 > 0) to tolerate a classifier that contains pre-diction error if the prediction error is below ε0 . Thevalue of κ decreases substantially (depending on the

settings of the parameter α(0 < α < 1) and the ex-ponent ν(ν > 0)) when a classifier’s value of ε in-creases. After the update of κ, the XCS comparesit’s κ to that of other classifier’s (i.e. other classi-fiers in the same [A]) by calculating its classifier’srelative accuracy κ′

=κ/∑x∈[A] κx. Finally, the F ofthe classifier is updated using F ← F+β(κ′−F).

2.4.4 Rule Discovery Component Using GA

The discovery component in XCS searches foraccurate classifiers in the classifier space by gener-ating new classifiers using GA. The XCS applies theGA to an [A] when the average elapsed time sincethe last GA performed on all the classifiers in [A] isgreater than qGA. In GA, XCS selects two parentclassifiers with a probability proportional to theirfitness values. Two offspring classifiers are gen-erated by applying crossover and mutation on thecopies of their parents. Most parameters of the par-ents are inherited by their offspring, except for sev-eral parameters that must be initialized; for exam-ple, the fitness F, to be relatively pessimistic aboutthe quality of the offspring, is multiplied by 0.1.After mutation, the generated offspring are insertedinto the population.

2.4.5 Macroclassifiers

The XCS extracts generalized rules (classifiers)by reducing redundant classifiers. The idea ofmacroclassifier is implemented by using an addi-tional parameter termed numerosity num. A clas-sifier with numerosity num = n is equivalent to nregular classifiers. When XCS generates a new clas-sifier, [P] is scanned to examine whether a macro-classifier exists with the same condition and actionas that of the new classifier. If [P] has a classi-fier with the same condition and action, the valueof num of the existing classifier (i.e. a macroclas-sifier) with the same condition and action is incre-mented by one instead of inserting the new classi-fier into [P]. Otherwise, the new classifier is addedto the population with num set to one. Similarly,when a macroclassifier experiences a deletion, thevalue of num is decremented by one and the macro-classifier with numerosity num = 0 is removed fromthe [P]. The macroclassifier technique reduces re-dundant classifiers and also speeds up the XCS ingenerating [M].


2.4.6 Classifier Deletion and Subsumption

XCS removes classifiers from [P] if the sum ofall num s of the classifiers in [P] exceeds a limit N(i.e. maximum population size predefined by theuser of XCS) when inserting a new classifier into[P] (by either using the cover mechanism or GA).The probability of removing a classifier from [P]is proportional to the estimation of the size of [A]s(i.e. the [A] that the classifier usually appears in).The XCS also increases the probability of deletionof an experienced classifier with the value of F thatis substantially lower than the average value of Fof all classifiers in [P]. Subsumption deletion is amethod to improve the generalization capability ofXCS, and occurs after the update process of [A] andG. Hence, the subsumption is also called action setsubsumption and GA subsumption [60]. During anaction set subsumption, XCS selects an experiencedclassifier G with ε < ε0 first; then G subsumes allthe other classifiers in [A] that are less general thanG, and the num of G is incremented based on thenum s of the classifiers subsumed. The XCS alsooperates GA subsumption when new classifiers (i.e.offspring) are generated through the GA. The off-spring are compared to their parent and subsumed ifthe parent classifier is experienced (defined by thetimes appeared in [A]) and more general.

3 HSV Patterns in the AffectiveImage Classification

3.1 Literature Review

To predict emotions of subject induced by animage, in 2005, Mikels et al. firstly categorizedimages in IAPS into different categories to iden-tify images that are especially excellent in induc-ing emotions of subjects [62]. Later, a pioneeringstudy [19] on affective image classification reportedby Wu et al. applied the Support Vector Machine(SVM) on identifying the relationships between vi-sual features extracted from images, and the seman-tic differential features (i.e. terms that given to sub-jects to describe the onset image, such as beautiful-ugly, dynamic-static, and tense-relaxed). The accu-racy rate obtained in the study was relatively high(i.e. 80%); however, there was only one subject in-volved in the experiment, and the emotional state ofthe subject was implicitly estimated through seman-

tic differential terms. To demonstrate the feasibilityin affective image classification, subsequently, ex-periments with larger sample size (around 15 to 20people) was conducted in [21, 23]; in these studies,emotions were explicitly defined as discrete emo-tional states, such as happy, surprising, sad, andangry. Further examinations on various features inaffective image-classification task was reported byMachajdik in [22]; however, the obtained accuracyrates were relatively low (around 65%) in the be-tween subject analysis in [21-23]. Latest findingsin [11, 12] highlighted the drawback on using dis-crete emotion models, in which definition on emo-tions using “terms” may be vague and inaccurate forthe subjects, and the use of discrete emotion modelis generally application dependent, which may biasthe collected dataset and the performance of clas-sification model built. Furthermore, the use of dis-crete definitions also makes the experiment resultshard to reproduce and hard to compare internation-ally. Hence, this study argues that the affectiveimage classification studies should be conductedbased on dimensional emotion model to reduce thedifficulties in reproducing comparable results.

To clarify the objectives, the affective imageclassification problem is formatted into a systemidentification task (see Figure 4), the aim of theproblem is to identify how human subjects inter-pret the affective characteristics of a given image;for example, to identify the human subject responseby discovering rules, or training intelligent systemsto predict the response (currently most of the worksaimed on the later approach).

To evaluate the emotion elicitation of a subject,despite numerous approaches are available. Forexample, self-report [63], facial expression [64],keystroke dynamics, user data, and psychophysio-logical data [65]. This study decided to utilize self-report as the measurement tool, because self-reportas a ground truth is considered to be more mean-ingful in the proposed problem and also the relatedfuture applications.

This study conducts an experimental study onaffective image classification by adopting dimen-sional emotion model instead of applying discreteemotion models that were typically used in pre-vious studies [22]. The SAM was used in thisstudy to estimate the emotion elicitation of sub-jects in the perspectives of dimensional emotion


2.4.6 Classifier Deletion and Subsumption

XCS removes classifiers from [P] if the sum ofall num s of the classifiers in [P] exceeds a limit N(i.e. maximum population size predefined by theuser of XCS) when inserting a new classifier into[P] (by either using the cover mechanism or GA).The probability of removing a classifier from [P]is proportional to the estimation of the size of [A]s(i.e. the [A] that the classifier usually appears in).The XCS also increases the probability of deletionof an experienced classifier with the value of F thatis substantially lower than the average value of Fof all classifiers in [P]. Subsumption deletion is amethod to improve the generalization capability ofXCS, and occurs after the update process of [A] andG. Hence, the subsumption is also called action setsubsumption and GA subsumption [60]. During anaction set subsumption, XCS selects an experiencedclassifier G with ε < ε0 first; then G subsumes allthe other classifiers in [A] that are less general thanG, and the num of G is incremented based on thenum s of the classifiers subsumed. The XCS alsooperates GA subsumption when new classifiers (i.e.offspring) are generated through the GA. The off-spring are compared to their parent and subsumed ifthe parent classifier is experienced (defined by thetimes appeared in [A]) and more general.

3 HSV Patterns in the AffectiveImage Classification


To predict emotions of subject induced by animage, in 2005, Mikels et al. firstly categorizedimages in IAPS into different categories to iden-tify images that are especially excellent in induc-ing emotions of subjects [62]. Later, a pioneeringstudy [19] on affective image classification reportedby Wu et al. applied the Support Vector Machine(SVM) on identifying the relationships between vi-sual features extracted from images, and the seman-tic differential features (i.e. terms that given to sub-jects to describe the onset image, such as beautiful-ugly, dynamic-static, and tense-relaxed). The accu-racy rate obtained in the study was relatively high(i.e. 80%); however, there was only one subject in-volved in the experiment, and the emotional state ofthe subject was implicitly estimated through seman-

tic differential terms. To demonstrate the feasibilityin affective image classification, subsequently, ex-periments with larger sample size (around 15 to 20people) was conducted in [21, 23]; in these studies,emotions were explicitly defined as discrete emo-tional states, such as happy, surprising, sad, andangry. Further examinations on various features inaffective image-classification task was reported byMachajdik in [22]; however, the obtained accuracyrates were relatively low (around 65%) in the be-tween subject analysis in [21-23]. Latest findingsin [11, 12] highlighted the drawback on using dis-crete emotion models, in which definition on emo-tions using “terms” may be vague and inaccurate forthe subjects, and the use of discrete emotion modelis generally application dependent, which may biasthe collected dataset and the performance of clas-sification model built. Furthermore, the use of dis-crete definitions also makes the experiment resultshard to reproduce and hard to compare internation-ally. Hence, this study argues that the affectiveimage classification studies should be conductedbased on dimensional emotion model to reduce thedifficulties in reproducing comparable results.

To clarify the objectives, the affective imageclassification problem is formatted into a systemidentification task (see Figure 4), the aim of theproblem is to identify how human subjects inter-pret the affective characteristics of a given image;for example, to identify the human subject responseby discovering rules, or training intelligent systemsto predict the response (currently most of the worksaimed on the later approach).

To evaluate the emotion elicitation of a subject,despite numerous approaches are available. Forexample, self-report [63], facial expression [64],keystroke dynamics, user data, and psychophysio-logical data [65]. This study decided to utilize self-report as the measurement tool, because self-reportas a ground truth is considered to be more mean-ingful in the proposed problem and also the relatedfuture applications.

This study conducts an experimental study onaffective image classification by adopting dimen-sional emotion model instead of applying discreteemotion models that were typically used in pre-vious studies [22]. The SAM was used in thisstudy to estimate the emotion elicitation of sub-jects in the perspectives of dimensional emotion


Figure 4. The affective image classification problem as a system identification task

model. The use of the experimental paradigm ofdimensional emotion model, on the other hand, ex-tends the traditional discrete emotion classificationtask into a continuous function approximation task,hence, in this study, the performance of classifierswere judged by Root Mean Square Error (RMSE)for two-dimension affective space prediction, andMean Absolute Error (MAE) for one-dimension af-fective space prediction, instead of accuracy rate.

The objective of the study is to upgrade the per-formance of affective image classification. How-ever, the results of previous studies, that obtainedfrom experimental designs based on discrete emo-tional model, are hard to compare, and even incom-parable in our case (because of the shift of perfor-mance criteria). Hence, this study examines theperformance obtained from the described task, andhopes to provide a baseline for future study instead.

3.2 Experimental Setup

3.2.1 Subjects

There were 16 university subjects participatedin the study (15 subjects is the typical sample sizerequired in the field of affective image classificationstudies [19, 23]), ranging in age between 20 and 28(M = 23.44, SD = 2.19; 10 men, 6 women). All sub-jects reported they were healthy, with no history ofbrain injury, cardiovascular problems, had normalor corrected-to-normal vision, and normal range offinger movement.

3.2.2 Experimental Procedure

To build an intelligent system that could pre-dict the emotions of subjects elicit by image, a hu-

man subject experiment was conducted. The en-tire experiment conducted in this study complies theIAPS protocol of emotion inducement described in[24] to guarantee the effectiveness of the emotioninduction procedure, and the clarity of the experi-mental design for reproduction. During the experi-ment, the subjects were requested to look at a screenwhich sequentially presents images and to corre-spondingly rate these images presented, by usingcomputer-based SAM (through the use of mouse).The duration of the experiment was 10 minutes foreach subject. Each trial (i.e. presentation of an im-age) started by presenting an image and displayed itfor 6 seconds, then presented the SAM on the screenfor the subject to manually rate the affective charac-teristics (i.e. self-report the induced emotion) of thepresented image. The SAM was followed by a 15s delay to ensure the emotional status of subject re-turn to baseline before the start of next trial and areasonable length to keep the subjects involved inthe experiment.

3.2.3 Images Used

This study utilizes 20 images selected fromIAPS [66] database in complying the IAPS imageset selection protocol described in [66]. The im-age ids of the used images are as follows: 1120,1310, 1390, 1710, 1720, 2160, 2220, 2520, 2530,2540, 3160, 3220, 3250, 4300, 4460, 4470, 4660,4750, 5950, 8160, 8200, and 9250. These imagescan be found in the IAPS database [66] using theids listed above. The order of the image presenta-tion was randomized to eliminate the effects due tothe presentation sequence.

To clariclassificasystem aim of tsubjects of a givehuman sor trainiresponseon the laTo evaludespite nexample[64], kepsychopdecided measureground meaningthe relateThis stuaffectivedimensioapplyingtypicallySAM wemotion perspectThe usedimensioextends classificaapproximperformaMean SqaffectiveError (Mspace prThe objperformaHoweve

Figure

fy the objecation probleidentificationthe problem

interpret then image; fosubject respoing intelligee (currently ater approachuate the emonumerous ape, self-reporeystroke dyhysiological

to utilizment tool, truth is c

gful in the ped future appdy conducts

e image clonal emotig discrete emy used in p

was used in telicitation

tives of dime of the eonal emotion

the tradiation task inmation task, ance of classquare Error (e space prediMAE) for ediction, instective of thance of affer, the result

4The affectiv

ctives, the aem is formn task (seeFis to identif

he affective or example, onse by disc

ent systems most of the

h). tion elicitatioproaches are

rt [63], faciynamics, us data [65]ze self-repbecause se

considered proposed proplications. an experimlassification ion model motion modrevious studthis study ton of su

mensional emxperimental

n model, on ttional discnto a contin

hence, in tsifiers were j(RMSE) for tiction, and M

one-dimenstead of accure study is t

ective image s of previou

e image classi

ffective imamatted into Figure 4), tfy how humcharacteristito identify t

covering ruleto predict t

e works aim

on of a subjee available. Fial expressiser data, a. This stu

port as tlf-report as to be mo

oblem and al

mental study by adoptiinstead

dels that wedies [22]. To estimate t

ubjectsin tmotion mod

paradigm the other han

crete emotinuous functithis study, tudged by Rotwo-dimensiMean Absolusion affectiracy rate. to upgrade t

classificatious studies, th

ification prob

age a

the man

ics the es, the

med

ect, For on

and udy the

a ore lso

on ng of

ere The the the

del. of

nd, on on the oot on ute ive

the on. hat

obdianthstthba

3

3.Thinsiclbemwcacoof

3.Topra Thcoinefprdethwcobyuswpran

lem as a syste

btained fromiscrete emotind even incohe shift of ptudy examinehe describedaseline for fu

.2Experim

.2.1 Subjechere were 16

n the study (ze required lassification etween 20 an

men, 6 womwere healthy, ardiovascularorrected-to-nf finger move

.2.2 Experio build anredict the emhuman subj

he entire expomplies thenducement dffectiveness rocedure, anesign for rephe subjects w

which sequenorrespondingy using comse of mouse)

was 10 minuteresentation on image and

em identificati

m experimenional model, omparable inerformance es the perfor

d task, and uture study in

mental Se

cts 6 university

(15 subjects in the fieldstudies [19,

nd 28 (M = 2men). All su

with no hisr problems

normal visionement.

imental Pron intelligent motions of subject experimperiment cone IAPS prescribed in

of the nd the clarityproduction. Dwere requestentially presegly rate thesmputer-based). The duraties for each suof an image)d displayed i

ion task

ntal designs bare hard to

n our case (becriteria). He

rmance obtaihopes to p

nstead.

etup

subjects paris the typica

d of affectiv23]), rangin

23.44, SD =ubjects reporstory of brais, had non, and norm

ocedure system th

ubjects elicit bment was conducted in trotocol of [24] to guaremotion i

y of the expeDuring the exed to look atents imagesse images pd SAM (throion of the exubject. Each) started bypt for 6 secon

based on compare, ecause of ence, this ned from

provide a

rticipated al sample ve image ng in age

= 2.19; 10 rted they in injury, rmal or

mal range

hat could by image, onducted. his study emotion

rantee the induction erimental

xperiment, t a screen s and to presented, ough the

xperiment h trial (i.e.

resenting nds, then


3.2.4 Environment Setting

The images were presented using a general PCwith 32-inch (81.28 centimeters) monitor. The sub-jects were sat in a comfortable bed at a distance ofapproximately 1.5 meters away from the monitor inan EMI shielding room (Acoustic Inc. US) in whicheliminates most of noise interferences and electri-cal noises. The CO2 concentration of the environ-ment was monitored during the entire experiment toguarantee reasonable CO2 concentration (500 ppm˜ 1,300 ppm) to keep subjects sustain their attentionduring the experiment.

3.3 Method

3.3.1 HSV Model

This chapter adopts the approach of feature ex-traction similar to the former studies [22, 23], inwhich only basic features based on colors were ex-traction from the image (HSV model, in our case)instead of applying content based analysis, to elim-inate the individual difference. Texture informa-tion was not used in this study because of thedocumentary-style natural of the IAPS images; im-ages in the IAPS hold similar texture properties,and the related features extracted from IAPS imageswas reported useless in [22].

The HSV model is a cylindrical-coordinate rep-resentation commonly used in the area of computergraphics in replacement of RGB color model to ob-tain more intuitive values. In the HSV model, Hrepresents hue, S represents saturation, and V rep-resents value. Ordinarily, images stored in elec-tronic devices such as personal computer are rep-resented by a M×N matrix, in which the color ofeach element is displayed using RGB color model.The RGB model is a model consists of three co-ordinates as following: R represents red values, Grepresents green values and B represents blue val-ues; red, green and blue are mixed together in acube. For affective features analysis, features ex-tracted from HSV model provide a more perceptu-ally relevant representation on images.

– Hue

Hue is simply the attribute represents visual sensa-tion on various colors similar to red, green, blue, orcombinations of them. The value of hue is in the

interval of 0◦

and 360◦(normalized to interval [0, 1]

in this study). The transformation from RGB to His demonstrated as following: Firstly, normalizes R,G, and B of the target element into the interval [0,1]. Secondly, calculates M, m and C from the nor-malized R, G, B.

M = max(R,G,B) ;m = min(R,G,B) ;C = M−m

Equation 2 The transformation from RGB to H(Step 2)

Thirdly, calculates H′and H.

H′=

0, if C = 0G−B

C mod6, if M = RB−R

C +2, if M = GR−G

C +4, if M = B


H =60◦×H

′


– Saturation

Saturation represents the level of colorfulness rela-tive to its own brightness. The value of saturation isin the interval [0, 1].

S =

{0, if C = 0

CM , o.w.

Equation 5 The calculation of saturation from the Cand the M value

– Value (Brightness)

The Value (brightness) represents the brightnesslevel relative to the brightness of a similarly illu-minated white, defined as the largest component ofthe RGB color of an element (i.e, M, 0 ≤ M ≤ 1)to form a hexagonal pyramid out of the RGB cubeby projecting all three primary colors and the sec-ondary colors such as cyan, yellow, and magentainto the new plane.



The images were presented using a general PCwith 32-inch (81.28 centimeters) monitor. The sub-jects were sat in a comfortable bed at a distance ofapproximately 1.5 meters away from the monitor inan EMI shielding room (Acoustic Inc. US) in whicheliminates most of noise interferences and electri-cal noises. The CO2 concentration of the environ-ment was monitored during the entire experiment toguarantee reasonable CO2 concentration (500 ppm˜ 1,300 ppm) to keep subjects sustain their attentionduring the experiment.

3.3 Method

3.3.1 HSV Model

This chapter adopts the approach of feature ex-traction similar to the former studies [22, 23], inwhich only basic features based on colors were ex-traction from the image (HSV model, in our case)instead of applying content based analysis, to elim-inate the individual difference. Texture informa-tion was not used in this study because of thedocumentary-style natural of the IAPS images; im-ages in the IAPS hold similar texture properties,and the related features extracted from IAPS imageswas reported useless in [22].

The HSV model is a cylindrical-coordinate rep-resentation commonly used in the area of computergraphics in replacement of RGB color model to ob-tain more intuitive values. In the HSV model, Hrepresents hue, S represents saturation, and V rep-resents value. Ordinarily, images stored in elec-tronic devices such as personal computer are rep-resented by a M×N matrix, in which the color ofeach element is displayed using RGB color model.The RGB model is a model consists of three co-ordinates as following: R represents red values, Grepresents green values and B represents blue val-ues; red, green and blue are mixed together in acube. For affective features analysis, features ex-tracted from HSV model provide a more perceptu-ally relevant representation on images.

– Hue

Hue is simply the attribute represents visual sensa-tion on various colors similar to red, green, blue, orcombinations of them. The value of hue is in the

interval of 0◦

and 360◦(normalized to interval [0, 1]

in this study). The transformation from RGB to His demonstrated as following: Firstly, normalizes R,G, and B of the target element into the interval [0,1]. Secondly, calculates M, m and C from the nor-malized R, G, B.

M = max(R,G,B) ;m = min(R,G,B) ;C = M−m


Thirdly, calculates H′and H.

H′=

0, if C = 0G−B

C mod6, if M = RB−R

C +2, if M = GR−G

C +4, if M = B


H =60◦×H

′


– Saturation

Saturation represents the level of colorfulness rela-tive to its own brightness. The value of saturation isin the interval [0, 1].

S =

{0, if C = 0

CM , o.w.

Equation 5 The calculation of saturation from the Cand the M value

– Value (Brightness)

The Value (brightness) represents the brightnesslevel relative to the brightness of a similarly illu-minated white, defined as the largest component ofthe RGB color of an element (i.e, M, 0 ≤ M ≤ 1)to form a hexagonal pyramid out of the RGB cubeby projecting all three primary colors and the sec-ondary colors such as cyan, yellow, and magentainto the new plane.


3.3.2 XCSF

The study applies XCSF to the affective image-classification task to cope with any possible non-linear characteristics contained in the target dataset.The XCSF is an extension of the XCS, a machinelearning system based on Michigan-Style CSs. In2002, XCSF, as a version of XCS used for functionapproximation was proposed [39]. The XCSF al-lows both real value inputs and real value outputs.In addition, the version of XCSF implemented in[67] allows multiple outputs The input accepts realvalue by using rotating hyperrectangle and rotat-ing hyperellipsoid for condition representation [33,68]. On the other hand, instead of selecting a dis-crete value as output according to fitness-weightedprediction value, the classifiers in the XCSF di-rectly map the desire output using the predictionvalue produced by the linear approximation (i.e.h(−→x )

=−→ω −→x in which −→x represents the input vec-tor and −→ω represents weight vector). Each classifierin the XCSF updates its weight vector using Re-cursive Least Squares (RLS) method [68]. For per-forming the RLS, each classifier manage by XCSFupdates its weight vector using

−→ω ←−→ω+−→k[

yt−(−→

x∗−−→m∗

)T−→ω]

Equation 6 Weight vector update of the XCSF basedon the performance of RLS

where yt represents target output, and−→k represents

the gain vector computed by

−→k =

VT(−→

x∗−−→m∗

)

λ+(−→

x∗−−→m∗

)TVT

(−→x∗−

−→m∗

)

Equation 7 The calculation of the gain vector in theXCSF

The l (usually 0 ≤ λ ≤ 1) used in Equation 6and Equation 7 represents the forget rate of RLS.The lower the value of l is the higher the forget rate.The value of l is set to 1.0 for having an infinitememory (mostly used in time invariant problems).The matrix V hold by each classifier updates recur-sively using

VT=λ−1[

I−−→k(−→

x∗−−→m∗

)T]

VT

Equation 8 The update of matrix V in the XCSF

The fitness value used for the GA in the XCSFis the relative classifier accuracy calculates fromsystem error [60]. For further detail, sufficient in-formation about XCS can be found in Butz’s algo-rithmic description of XCS [60], and also the re-cent advances in XCSF [33, 39, 68]. To summarize,the XCSF can be understood as a manager whichmanages a set of classifiers. Each of the classi-fiers maps from a subspace in the feature space tothe landscape-function output using a linear-fittingmethod.

3.3.3 Model Building

The workflow of the preprocessing and modelbuilding are provided in Figure 5 The workflow ofthe built prediction model ; the preprocessing ofthe image data was based on HSV model withoutapplying content based analysis, 6 features wereused for model building in this study, including:average hue, standard deviation of hue, averagesaturation, standard deviation of saturation, aver-age brightness, and standard deviation of bright-ness. The model was built to predict the inducedemotion rated by subjects in terms of valence andarousal through SAM. The prediction of valenceand arousal can be real number herein accordingto the definition of valence and arousal in the di-mensional theory of emotion [11]. To avoid over-fitting problem, a Leave-One-Out-Cross-Validation(LOOCV) on leaving one sample at each time fortesting set and the remain samples for training set,which is the standard practice for analyzing limiteddataset, was used for building the model.

For details on the setting of LR and XCSFfor building the models, the LR analysis was doneby using the Weka implementation of data miningtools [69], in which Akaike criterion was used formodel selection and M5’s method was used for at-tribute selection; all the co-linear attributes were ex-cluded.

The XCSF used in this study was adopted fromthe Java implementation version on XCSF con-tributed by Stalph and Butz (2009) [67]. For pa-rameters setting, a = 1.0; b = 0.1; d = 0.1; l = 1.0;qGA = 50; e0 = 0.5; drls = 1000; qdel = 20; c = 1.0;m = 1.0; qsub = 20; the GA subsumption was turnedon. Although the maximal population size N wasset to 6,400˜10,000 to maximize the performance ofXCSF, the number of classifiers quickly converged


Figure 5. The workflow of the built prediction model

to 5,400 during the model training. To examinethe performance of the system, e0 was set to var-ious values. However, it appears relatively smalleffect on the learning performance in regard to thelearning speed and system error. During the modeltraining, the XCSF was sequentially presented with20,000 instances randomly selected from the train-ing dataset.

3.4 Results and Discussion

3.4.1 Collected Dataset

The collected dataset contains 20 images(1024x768 JPEG) used in the experiment, and theimage affective ratings rated by 16 subjects throughSAM. The experiment totally acquired 320 rows(actually, 318 rows, while two rows were excludeddue to machine mal functioning) of raw data (im-ages, and the affective ratings of the images, 20rows for each subject). Figure 6 presents the distri-bution of the ratings selected by subjects on all im-ages; it is observed that most subjects were arousedwith either unpleasant feelings or pleasant feelingsby the displayed images, no obvious skewed wasobserved in the distribution of valence (histogramwas examined but not shown).

3.4.2 Model Performance and Discussion

The performance in regard to RMSE/MAE andthe standard deviation of MAEs (represents by SD)achieved by LR and XCSF are provided in.

Table 1. The performance achieved by distinctclassifiers

The manner of calculating RMSE and MAE areprovided as following:

RMSE =

√1

N−1

N

∑i=1

[(Vi−VPi)

2+(Ai−APi)2]

MAE =1N

N

∑i=1

|Vi−VPi| or1N

N

∑i=1

|Ai−APi|

Equation 9 The calculation of RMSE and MAE inthis research

in which N represents sample size; Vi and Ai rep-resents the values of the valence and arousal corre-sponds to the i-th sample; and VPi and APi repre-sents the system prediction on the values of valenceand arousal corresponds to the i-th sample. TheMAEs are used here to evaluate the performance ofa built model in predicting valence and arousal. TheRMSE is always adopted when a classifier is usedfor predicting valence and arousal in pairs, and theMAEs are always adopted when a classifier is usedfor predicting valence and arousal separately.

study, deviationdeviationstandardwas builby subjethrough arousal cthe defindimensioover-fittiValidatioeach timsamples practice for buildFor detabuildingby usingmining twas usedwas uselinear attThe XCSthe JavacontribuFor para0.1; = 20; subsumpmaximal6,400~1XCSF, convergeTo exam

including: n of hue, avn of saturatio

d deviation olt to predict ects in termSAM. The

can be real nnition of vaonal theory ing problemon (LOOCVme for testfor training for analyzin

ding the modails on the se the models,g the Wekatools [69], id for model ed for attribtributes wereSF used in tha implemen

uted by Stalpameters settiGA = 5 = 1.0; =

ption was tl population0,000 to maxthe numbe

ed to 5,400 mine the pe

Figure

average verage saturaon, average bof brightnesthe induced

ms of valencprediction o

number hereialence and aof emotion

m, a Leave-O) on leaving ting set an

set, which ing limited datdel. etting of LR , the LR anaa implementin which Akselection and

bute selectioe excluded. his study wasntation versiph and Butzing, = 1.00; 0 = 0.5; = 1.0; sub =turned on. n size N ximize the p

er of classduring the mrformance o

e 5The workfl

hue,standaation, standabrightness, ass. The mod

emotion ratce and arousof valence ain according arousal in t[11]. To avo

One-Out-Crosone sample

d the remais the standataset, was us

and XCSF flysis was dotation of dakaike criterid M5's methn; all the c

s adopted froion on XCSz (2009) [670; = 0.1;rls = 1000; = 20; the GAlthough twas set

performance ifiers quick

model traininof the system

low of the bui

ard ard and del ted sal

and to

the oid ss-at

ain ard sed

for one ata on

hod co-

om SF 7]. = del GA the to of

kly ng. m,

0appeanth20tra

3

3.Th(1thth32wfuafeaofimarplobdiex

3.ThthSDBod

ilt prediction m

0was set toppears relativerformance nd system ehe XCSF w0,000 instanaining datase

.4 Results

.4.1Collectehe collected

1024x768 JPhe image affehrough SAM20 rows (act

were excludunctioning) ffective ratinach subject).f the ratingmages; it is oroused withleasant feelinbvious skeistribution xamined but

.4.2Model Phe performan

he standard dD) achieved łąd! Nie dwołania..

model

o various vvely small ein regard torror. During

was sequentinces randomet.

s and Disc

ed Dataset d dataset c

PEG) used inective ratings. The experitually, 318 rded due of raw dat

ngs of the i.Figure 6presgs selected observed tha

h either unpngs by the d

ewed was of valencenot shown).

Performannce in regard

deviation of Mby LR and X

można

values. Howeffect on theo the learning the model ially presenly selected

cussion

contains 20n the experims rated by 16ment totally ows, while tto machin

ta (images, images, 20 sents the disby subjects

at most subjepleasant feedisplayed im

observed e (histogra

nce and Disd to RMSE/MMAEs (repreXCSF are pro

odnaleźć

wever, it e learning ng speed training,

nted with from the

0 images ment, and 6 subjects

acquired two rows ne mal and the

rows for stribution s on all ects were elings or

mages, no in the

am was

cussion MAE and esents by ovided in

źródła

Figure 6The distribution of the induced emotion of subjects on all images

Table 1The performance achieved by distinct classifiers

Prediction Results Affective Dimension Valence Arousal (Valence, Arousal)

Method MAE SD MAE SD RMSE uniRand 2.569 (N/A) 2.530 (N/A) (N/A)largCount 1.613 (N/A) 1.480 (N/A) (N/A)LR 1.482 1.021 1.481 1.070 2.564XCSF 0.970 0.747 1.460 1.029 2.165

The manner of calculating RMSE and MAE are provided as following:

RMSE

� � 1N � 1� ��V� � V�� A� � A��

�

��

MAE � 1N� |V� � V��|

�

��or 1N� |A�

�

�� A��| Equation 9The calculation of RMSE and MAE in this research

in which N represents sample size; Vi and Ai represents the values of the valence and arousal corresponds to the i-th sample; and VPi and APi represents the system prediction on the values of valence and arousal corresponds to the i-th sample. The MAEs are used here to evaluate the performance of a built model in predicting valence and arousal. The RMSE is always adopted when a classifier is used for predicting valence and arousal in pairs, and the MAEs are always adopted when a classifier is used for predicting valence and arousal separately. While the emotional ratings are not uniformly distributed, the MAE of prediction can be artificially underestimated; hence, two models, 1) uniRand: making predictions in a uniformly random manner, and 2) largCount: making constant predictions based on the weighted-average valence, and weighted-average arousal, based on the ratings, in which average value of valance was nearly 3.931 and average value of arousal was nearly 4.349, were introduced to compare the MAE achieved by LR and XCSF. The performance of distinct classifiers is provided inTable 1. The performance of LR on


Figure 5. The workflow of the built prediction model

to 5,400 during the model training. To examinethe performance of the system, e0 was set to var-ious values. However, it appears relatively smalleffect on the learning performance in regard to thelearning speed and system error. During the modeltraining, the XCSF was sequentially presented with20,000 instances randomly selected from the train-ing dataset.





The performance in regard to RMSE/MAE andthe standard deviation of MAEs (represents by SD)achieved by LR and XCSF are provided in.

Table 1. The performance achieved by distinctclassifiers

The manner of calculating RMSE and MAE areprovided as following:

RMSE =

√1

N−1

N

∑i=1

[(Vi−VPi)

2+(Ai−APi)2]

MAE =1N

N

∑i=1

|Vi−VPi| or1N

N

∑i=1

|Ai−APi|

Equation 9 The calculation of RMSE and MAE inthis research

in which N represents sample size; Vi and Ai rep-resents the values of the valence and arousal corre-sponds to the i-th sample; and VPi and APi repre-sents the system prediction on the values of valenceand arousal corresponds to the i-th sample. TheMAEs are used here to evaluate the performance ofa built model in predicting valence and arousal. TheRMSE is always adopted when a classifier is usedfor predicting valence and arousal in pairs, and theMAEs are always adopted when a classifier is usedfor predicting valence and arousal separately.


Figure 6. The distribution of the induced emotion of subjects on all images

While the emotional ratings are not uniformlydistributed, the MAE of prediction can be artifi-cially underestimated; hence, two models, 1) uni-Rand: making predictions in a uniformly randommanner, and 2) largCount: making constant predic-tions based on the weighted-average valence, andweighted-average arousal, based on the ratings, inwhich average value of valance was nearly 3.931and average value of arousal was nearly 4.349, wereintroduced to compare the MAE achieved by LRand XCSF. The performance of distinct classifiers isprovided in Table 1. The performance of LR on pre-dicting valence in regard to MAE is 1.483, whichis relatively low while the uniRand achieved only2.570 and largCount achieved 1.614. In addition,the MAE value achieved by XCSF on predictingvalence decreased the MAE value achieved by LRfrom 1.483±1.02 to 0.971±0.747 (-0.512), demon-strates the capability of XCSF on mapping func-tions that possibly contain non-linearity by manag-ing a set of linear classifiers. The MAE achieved byXCSF was small and the standard deviation of theMAE is tolerable.

To further examine the performance of XCSF

on this task, the performance of classifiers on pre-dicting valence values in terms of MAE are illus-trated in Figure 7, in which x-axis represents va-lence and y-axis represents MAE. The MAE oneach valence is represented by four bars: the MAEachieved by uniRand, largCount, XCSF, and LR, re-spectively. The MAE achieved by XCSF is smallerthan the MAE of LR, uniRand and largCount atmost of the emotional ratings; the MAE of XCSFis only larger than uniRand at the rating with thelargest count. A skew on MAE was observed forthose ratings that represent for “being pleasant”,that is, 5˜8, possibly due to the sample size of therating, while in Table 1, the numbers of samples ofvalence equals to 5, 6, 7 are larger than the numbersof samples of valence equals to 0 and 1.

Conversely, the MAE of XCSF in valence 0 and8 are also high. The finding suggests that insuf-ficient on sample size of a class may lead to lowperformance of XCSF on approximating the corre-sponding output value even the training instanceswere selected from the training dataset randomlyduring the XCSF iterative training process. How-ever, the MAE of XCSF at valence value equals to

Figure 6The distribution of the induced emotion of subjects on all images

Table 1The performance achieved by distinct classifiers

Prediction Results Affective Dimension Valence Arousal (Valence, Arousal)

Method MAE SD MAE SD RMSE uniRand 2.569 (N/A) 2.530 (N/A) (N/A)largCount 1.613 (N/A) 1.480 (N/A) (N/A)LR 1.482 1.021 1.481 1.070 2.564XCSF 0.970 0.747 1.460 1.029 2.165

The manner of calculating RMSE and MAE are provided as following:

RMSE

� � 1N � 1� ��V� � V�� A� � A��

�

��

MAE � 1N� |V� � V��|

�

��or 1N� |A�

�

�� A��| Equation 9The calculation of RMSE and MAE in this research

in which N represents sample size; Vi and Ai represents the values of the valence and arousal corresponds to the i-th sample; and VPi and APi represents the system prediction on the values of valence and arousal corresponds to the i-th sample. The MAEs are used here to evaluate the performance of a built model in predicting valence and arousal. The RMSE is always adopted when a classifier is used for predicting valence and arousal in pairs, and the MAEs are always adopted when a classifier is used for predicting valence and arousal separately. While the emotional ratings are not uniformly distributed, the MAE of prediction can be artificially underestimated; hence, two models, 1) uniRand: making predictions in a uniformly random manner, and 2) largCount: making constant predictions based on the weighted-average valence, and weighted-average arousal, based on the ratings, in which average value of valance was nearly 3.931 and average value of arousal was nearly 4.349, were introduced to compare the MAE achieved by LR and XCSF. The performance of distinct classifiers is provided inTable 1. The performance of LR on


4.0 is not the lowest, indicating that the larger sam-ple size may only guarantee the efficacy of XCSFin function approximation but not eliminating allexisting errors. We observed that approximately30 samples (which is nearly 10% of the number ofraw data), is sufficient for XCSF to approximate avalence value in the collected dataset. Otherwise,for example, the MAEs of XCSF on predicting thesamples that valence equals to 0 and 8 are relativelyhigh, in which the numbers of samples were small.To explain the extreme cases happened in valence0 and valence 8 in the psychological perspective,we’ve noticed that some of the subjects reportedthat they tend to select ratings in the middle of thescale rather than selecting those ratings representfor extreme cases. Such phenomenon may causethe non-linearity of the distances between the lev-els of valence and arousal. Further clarification isrequired for this issue; a well-designed transforma-tion may be adequate.

Figure 7. The MAE of XCSF on each valencevalue. The standard deviation of MAE made by

XCSF is nearly 0.54˜0.91.

Figure 8. The MAE of XCSF on each arousalvalue. The standard deviation of MAE made by

XCSF is nearly 0.27˜0.91

Figure 8 also presents similar phenomenon.The MAEs achieved by XCSF in each level of

arousal mostly outperform the MAEs of LR. How-ever, in general, the MAEs achieved by XCSF ateach level were increased, and the decreases in er-ror are not sufficiently significant. The results pro-vide that MAE of LR and XCSF did not outperformlargCount in the prediction on arousal. Such obser-vation indicates that the prediction model of arousalbuilt by LR and XCSF did not adequately identifythe problem structure, possibly due to the ineffec-tiveness of SAM on estimating subjects’ arousal,while some of subjects reported that during the ex-periment the definition of “being aroused” can beeasily confused with the definition of the tendencyof valence. The confusion was possibly caused bythe cultural difference, but similar results were nothighlighted previously in the main stream of the re-search.

To further identify the discovered knowledge,the prediction models built by LR are provided inEquation 10 and Equation 11.

Valence = -2.3147 * Avg Saturation + 4.6681* Avg Brightness + 12.8186 * SD Brightness -0.3798 Equation 10 The model of predicting va-lence based on HSV properties Arousal = 3.4657 *SD Saturation - 3.3625 * SD Brightness + 4.3804

Equation 11 The model of predicting arousal basedon HSV properties

From Equation 10, the saturation of an imagetends to lower the valence, whereas the brightnesstends to enhance the pleasant feelings, and the ef-fect of standard deviation of brightness on valenceis even more substantial. On the other hand, the af-fective characteristics of an image making peoplefeel aroused, is negative correlated with the stan-dard deviation of brightness, whereas the standarddeviation of saturation increases the effect.

Gender information was not used for modelbuilding because the effect of gender is eliminatedby the use of IAPS protocol, the use of gender in-formation did not substantially improve the perfor-mance during this study as well. By contrast tothe gender information, within subject analysis wasalso applied to the dataset using LR and XCSF.The result indicates that without using the contentwithin the image, the effects of individual differ-ence is relatively small (not shown here).

In regard to the concerns of the application ofthis study, the RGB model is device-dependent, due

predicting valence in regard to MAE is 1.483, which is relatively low while the uniRand achieved only 2.570 and largCount achieved 1.614. In addition, the MAE value achieved by XCSF on predicting valence decreased the MAE value achieved by LR from 1.483±1.02 to 0.971±0.747 (-0.512), demonstrates the capability of XCSF on mapping functions that possibly contain non-linearity by managing a set of linear classifiers. The MAE achieved by XCSF was small and the standard deviation of the MAE is tolerable. To further examine the performance of XCSF on this task, the performance of classifiers on predicting valence values in terms of MAE are illustrated inFigure 7, in which x-axis represents valence and y-axis represents MAE. The MAE on each valence is represented by four bars: the MAE achieved by uniRand, largCount, XCSF, and LR, respectively. The MAE achieved by XCSF is smaller than the MAE of LR, uniRand and largCount at most of the emotional ratings; the MAE of XCSF is only larger than uniRand at the rating with the largest count. A skew on MAE was observed for those ratings that represent for “being pleasant”, that is, 5~8, possibly due to the sample size of the rating, while inTable 1, the numbers of samples of

Figure 7The MAE of XCSF on each valence value. The standard deviation of MAE made by XCSF is nearly 0.54~0.91. valence equals to 5, 6, 7 are larger than the numbers of samples of valence equals to 0 and 1. Conversely, the MAE of XCSF in valence 0 and 8 are also high. The finding suggests that insufficient on sample size of a class may lead to low performance of XCSF on approximating the corresponding output value even the training instances were selected from the training dataset randomly during the XCSF iterative training process. However, the MAE of XCSF at valence value equals to 4.0 is not

the lowest, indicating that the larger sample size may only guarantee the efficacy of XCSF in function approximation but not eliminating all existing errors. We observed that approximately 30 samples (which is nearly 10% of the number of raw data), is sufficient for XCSF to approximate a valence value in the collected dataset. Otherwise, for example, the MAEs of XCSF on predicting the samples that valence equals to 0 and 8 are relatively high, in which the numbers of samples were small. To explain the extreme cases happened in valence 0 and valence 8 in the psychological perspective, we’ve noticed that some of the subjects reported that they tend to select ratings in the middle of the scale rather than selecting those ratings represent for extreme cases. Such phenomenon may cause the non-linearity of the distances between the levels of valence and arousal. Further clarification is required for this issue; a well-designed transformation may be adequate. Figure 8also presents similar phenomenon. The MAEs achieved by XCSF in each level of arousal mostly outperform the MAEs of LR. However, in general, the MAEs achieved by XCSF at each level were increased, and the decreases in error are not sufficiently significant.

Figure 8The MAE of XCSF on each arousal value. The standard deviation of MAE made by XCSF is nearly 0.27~0.91 The results provide that MAE of LR and XCSF did not outperform largCount in the prediction on arousal. Such observation indicates that the prediction model of arousal built by LR and XCSF did not adequately identify the problem structure, possibly due to the ineffectiveness of SAM on estimating subjects’ arousal, while some of subjects reported that during the experiment the definition of “being aroused” can be easily confused with the definition of the tendency of valence. The confusion was

predicting valence in regard to MAE is 1.483, which is relatively low while the uniRand achieved only 2.570 and largCount achieved 1.614. In addition, the MAE value achieved by XCSF on predicting valence decreased the MAE value achieved by LR from 1.483±1.02 to 0.971±0.747 (-0.512), demonstrates the capability of XCSF on mapping functions that possibly contain non-linearity by managing a set of linear classifiers. The MAE achieved by XCSF was small and the standard deviation of the MAE is tolerable. To further examine the performance of XCSF on this task, the performance of classifiers on predicting valence values in terms of MAE are illustrated inFigure 7, in which x-axis represents valence and y-axis represents MAE. The MAE on each valence is represented by four bars: the MAE achieved by uniRand, largCount, XCSF, and LR, respectively. The MAE achieved by XCSF is smaller than the MAE of LR, uniRand and largCount at most of the emotional ratings; the MAE of XCSF is only larger than uniRand at the rating with the largest count. A skew on MAE was observed for those ratings that represent for “being pleasant”, that is, 5~8, possibly due to the sample size of the rating, while inTable 1, the numbers of samples of

Figure 7The MAE of XCSF on each valence value. The standard deviation of MAE made by XCSF is nearly 0.54~0.91. valence equals to 5, 6, 7 are larger than the numbers of samples of valence equals to 0 and 1. Conversely, the MAE of XCSF in valence 0 and 8 are also high. The finding suggests that insufficient on sample size of a class may lead to low performance of XCSF on approximating the corresponding output value even the training instances were selected from the training dataset randomly during the XCSF iterative training process. However, the MAE of XCSF at valence value equals to 4.0 is not

the lowest, indicating that the larger sample size may only guarantee the efficacy of XCSF in function approximation but not eliminating all existing errors. We observed that approximately 30 samples (which is nearly 10% of the number of raw data), is sufficient for XCSF to approximate a valence value in the collected dataset. Otherwise, for example, the MAEs of XCSF on predicting the samples that valence equals to 0 and 8 are relatively high, in which the numbers of samples were small. To explain the extreme cases happened in valence 0 and valence 8 in the psychological perspective, we’ve noticed that some of the subjects reported that they tend to select ratings in the middle of the scale rather than selecting those ratings represent for extreme cases. Such phenomenon may cause the non-linearity of the distances between the levels of valence and arousal. Further clarification is required for this issue; a well-designed transformation may be adequate. Figure 8also presents similar phenomenon. The MAEs achieved by XCSF in each level of arousal mostly outperform the MAEs of LR. However, in general, the MAEs achieved by XCSF at each level were increased, and the decreases in error are not sufficiently significant.

Figure 8The MAE of XCSF on each arousal value. The standard deviation of MAE made by XCSF is nearly 0.27~0.91 The results provide that MAE of LR and XCSF did not outperform largCount in the prediction on arousal. Such observation indicates that the prediction model of arousal built by LR and XCSF did not adequately identify the problem structure, possibly due to the ineffectiveness of SAM on estimating subjects’ arousal, while some of subjects reported that during the experiment the definition of “being aroused” can be easily confused with the definition of the tendency of valence. The confusion was


4.0 is not the lowest, indicating that the larger sam-ple size may only guarantee the efficacy of XCSFin function approximation but not eliminating allexisting errors. We observed that approximately30 samples (which is nearly 10% of the number ofraw data), is sufficient for XCSF to approximate avalence value in the collected dataset. Otherwise,for example, the MAEs of XCSF on predicting thesamples that valence equals to 0 and 8 are relativelyhigh, in which the numbers of samples were small.To explain the extreme cases happened in valence0 and valence 8 in the psychological perspective,we’ve noticed that some of the subjects reportedthat they tend to select ratings in the middle of thescale rather than selecting those ratings representfor extreme cases. Such phenomenon may causethe non-linearity of the distances between the lev-els of valence and arousal. Further clarification isrequired for this issue; a well-designed transforma-tion may be adequate.

Figure 7. The MAE of XCSF on each valencevalue. The standard deviation of MAE made by

XCSF is nearly 0.54˜0.91.

Figure 8. The MAE of XCSF on each arousalvalue. The standard deviation of MAE made by

XCSF is nearly 0.27˜0.91

Figure 8 also presents similar phenomenon.The MAEs achieved by XCSF in each level of

arousal mostly outperform the MAEs of LR. How-ever, in general, the MAEs achieved by XCSF ateach level were increased, and the decreases in er-ror are not sufficiently significant. The results pro-vide that MAE of LR and XCSF did not outperformlargCount in the prediction on arousal. Such obser-vation indicates that the prediction model of arousalbuilt by LR and XCSF did not adequately identifythe problem structure, possibly due to the ineffec-tiveness of SAM on estimating subjects’ arousal,while some of subjects reported that during the ex-periment the definition of “being aroused” can beeasily confused with the definition of the tendencyof valence. The confusion was possibly caused bythe cultural difference, but similar results were nothighlighted previously in the main stream of the re-search.

To further identify the discovered knowledge,the prediction models built by LR are provided inEquation 10 and Equation 11.

Valence = -2.3147 * Avg Saturation + 4.6681* Avg Brightness + 12.8186 * SD Brightness -0.3798 Equation 10 The model of predicting va-lence based on HSV properties Arousal = 3.4657 *SD Saturation - 3.3625 * SD Brightness + 4.3804

Equation 11 The model of predicting arousal basedon HSV properties

From Equation 10, the saturation of an imagetends to lower the valence, whereas the brightnesstends to enhance the pleasant feelings, and the ef-fect of standard deviation of brightness on valenceis even more substantial. On the other hand, the af-fective characteristics of an image making peoplefeel aroused, is negative correlated with the stan-dard deviation of brightness, whereas the standarddeviation of saturation increases the effect.

Gender information was not used for modelbuilding because the effect of gender is eliminatedby the use of IAPS protocol, the use of gender in-formation did not substantially improve the perfor-mance during this study as well. By contrast tothe gender information, within subject analysis wasalso applied to the dataset using LR and XCSF.The result indicates that without using the contentwithin the image, the effects of individual differ-ence is relatively small (not shown here).

In regard to the concerns of the application ofthis study, the RGB model is device-dependent, due


to the color elements (such as phosphors or dyes)and their response to the individual R, G, and B lev-els vary from manufacturer to manufacturer, differ-ent devices may detect or reproduce a given RGBvalue distinctively, or even in the same device overtime. Such characteristics may cause variations inthe proposed experimental results. However, sim-ilar problem may also occur in other studies thatutilize IAPS, but currently, of our best knowledge,no previous study reports further problem due to it.

4 Spatial-Frequency Patterns inthe Affective Image Classification


To index the affective characteristics of images,an intuitive approach is to have a large number ofpeople manually rate all the images and calculatedescriptive statistics from the ratings. On the otherhand, recent studies utilize color, texture, and com-position information of images; also the applicationof content analysis, to achieve affective image clas-sification [22, 70].

However, the features related to spatial-frequency domain that are proven to be useful forpattern recognition have not been explored yet. Inaddition, contributed by recent advances in method-ology, the resolution in frequency analysis has beenimproved. Hence, this chapter aims to solve the af-fective image-classification task by using the fea-tures related to spatial-frequency domain, and theXCSF [68] (i.e. one of a latest version of LCS [71]).The dataset used for the classification task is col-lected from a human-subject experiment conductedin our laboratory. The performance of the builtintelligent machine in performing affective image-classification task was validated by 10-Fold CV.The proposed method may be applied to other im-ages in real-word.

4.2 Experimental Setup

4.2.1 Subjects

There were 16 university subjects participatedin the study (15 subjects is the typical sample sizerequired in the field of affective image classificationstudies [19, 23]), ranging in age between 20 and 28(M = 23.44, SD = 2.19; 10 men, 6 women). All sub-

jects reported they were healthy, with no history ofbrain injury, cardiovascular problems, had normalor corrected-to-normal vision, and normal range offinger movement.

4.2.2 Experimental Procedure

To build an intelligent system that could pre-dict the emotions of subjects elicit by image, a hu-man subject experiment was conducted. The en-tire experiment conducted in this study complies theIAPS protocol of emotion inducement described in[24] to guarantee the effectiveness of the emotioninduction procedure, and the clarity of the experi-mental design for reproduction. During the experi-ment, the subjects were requested to look at a screenwhich sequentially presents images and to corre-spondingly rate these images presented, by usingcomputer-based SAM (through the use of mouse).The duration of the experiment was 10 minutes foreach subject. Each trial (i.e. presentation of an im-age) started by presenting an image and displayed itfor 6 seconds, then presented the SAM on the screenfor the subject to manually rate the affective charac-teristics (i.e. self-report the induced emotion) of thepresented image. The SAM was followed by a 15s delay to ensure the emotional status of subject re-turn to baseline before the start of next trial and areasonable length to keep the subjects involved inthe experiment.

4.2.3 Images Used

This study utilizes 20 images selected fromIAPS [66] database in complying the IAPS imageset selection protocol described in [66]. The im-age ids of the used images are as follows: 1120,1310, 1390, 1710, 1720, 2160, 2220, 2520, 2530,2540, 3160, 3220, 3250, 4300, 4460, 4470, 4660,4750, 5950, 8160, 8200, and 9250. These imagescan be found in the IAPS database [66] using theids listed above. The order of the image presenta-tion was randomized to eliminate the effects due tothe presentation sequence.


The images were presented using a general PCwith 32-inch (81.28 centimeters) monitor. The sub-jects were sat in a comfortable bed at a distance ofapproximately 1.5 meters away from the monitor inan EMI shielding room (Acoustic Inc. US) in which


eliminates most of noise interferences and electri-cal noises. The CO2 concentration of the environ-ment was monitored during the entire experiment toguarantee reasonable CO2 concentration (500 ppm˜ 1,300 ppm) to keep subjects sustain their attentionduring the experiment.

4.3 Method

4.3.1 Two-Dimensional Hilbert-Huang Trans-form (2D-HHT)

Spatial-frequency analysis on images is one ofthe well-known techniques used in the field of im-age processing and computer vision [72, 73]. Theinformation in frequency domain was found abun-dant by physiologists [74]. It was found that variousspatial-frequencies can lead to distinct characteris-tics of visual stimulations. Moreover, the orienta-tion of visual stimulation can cause different effica-cies in stimulations of cortical receptors [75, 76].

Traditionally, Fast Fourier Transform (FFT) isused to transform an image into frequency domain.However, due to the assumption of that series oftarget data should be at least piecewise stationary,the FFT-based techniques (e.g., spectrogram), isnot suitable for modeling local phenomena or whenhigher resolution is required. Hence, recently HHTwas proposed to obtain higher frequency resolutiontoward Instantaneous Frequency (IF) [77]. Later,the use of such concept in spatial-frequency anal-ysis was also reported [78]. The HHT is a two-phase transformation, which firstly apply an Empir-ical Mode Decomposition (EMD) on the target dataseries to extract Intrinsic Mode Functions (IMFs).Secondly, Hilbert Transform (HT) is applied to eachIMF to obtain required frequency domain informa-tion (i.e., IF). The EMD is a shifting process thatcan be used to extract IMFs from a data series X(s).The IMF is defined as a monocomponent by satis-fying the criterias as following:

1 has the number of zero crossings and extremaone difference at most,

2 symmetric with respect to the local mean, and

3 the X(s) should has at least two extrema.

After the procedure of EMD, n IMFs, namely,IMF1, IMF2, IMF3, . . . , IMFn, and the

residuals(rn), denoted as

X(s)=n

∑j=1

Cj+rn

Equation 12 The decomposition of an input signalbased on the EMD provided in Equation 12, are ex-tracted from X(s). The residuals (rn) is a data serieswhich is the remainder series of target data series af-ter the EMD shifting process removes all the IMFsfrom the original target data series.

The procedure of EMD, different from theFourier and Wavelet Decomposition, is fully data-driven. By being adaptive and unsupervised, theEMD improves the efficiency of signal decompo-sition and can be applied to the non-linear and non-stationary signal (details on the procedure of theEMD please refer to [77]). After the EMD, the HTis then applied to each IMF

Yj (s)=1π

∫ ∞

−∞

Cj (τ)s−τ

dτ

Equation 13 Transforming each IMF j extracted bythe EMD into Y j(s)

Each IMFj can be represented by the conju-gate pair of Y j(s) and C j(s), hence can be repre-sented by an analytical signal Z(s) = C j(s) + iY j(s)= a(s)eiq(s), in which the amplitude

aj (s)=√

Cj (s)2+Yj (s)

2

Equation 14 The computation of amplitudes in HT

and phase q j(s) = arctan(Y j(s)/C j(s)). Based on thedefinition stated above, the IF j can be derived byapplying a derivative on q j(s) (i.e. w j=dq j(s)/ds).Then, an analytical representation of X(s), can bederived

X(s)=n

∑j=1

aj (s) [i∫

ωj (s)ds]

Equation 15 Analytical representation of the inputsignal based on the HHT

Originally, the EMD was proposed to decom-pose one-dimensional data. To construct a 2D-HHT, the concept of EMD was extended to 2D inthis study based on the concept listed as follows:

1 identify the extrema (maxima and minima) ofthe image by sliding a 3-by-3 grid;

2 generate two smooth 2D surfaces to fit the foundmaxima and minima;


eliminates most of noise interferences and electri-cal noises. The CO2 concentration of the environ-ment was monitored during the entire experiment toguarantee reasonable CO2 concentration (500 ppm˜ 1,300 ppm) to keep subjects sustain their attentionduring the experiment.

4.3 Method

4.3.1 Two-Dimensional Hilbert-Huang Trans-form (2D-HHT)

Spatial-frequency analysis on images is one ofthe well-known techniques used in the field of im-age processing and computer vision [72, 73]. Theinformation in frequency domain was found abun-dant by physiologists [74]. It was found that variousspatial-frequencies can lead to distinct characteris-tics of visual stimulations. Moreover, the orienta-tion of visual stimulation can cause different effica-cies in stimulations of cortical receptors [75, 76].

Traditionally, Fast Fourier Transform (FFT) isused to transform an image into frequency domain.However, due to the assumption of that series oftarget data should be at least piecewise stationary,the FFT-based techniques (e.g., spectrogram), isnot suitable for modeling local phenomena or whenhigher resolution is required. Hence, recently HHTwas proposed to obtain higher frequency resolutiontoward Instantaneous Frequency (IF) [77]. Later,the use of such concept in spatial-frequency anal-ysis was also reported [78]. The HHT is a two-phase transformation, which firstly apply an Empir-ical Mode Decomposition (EMD) on the target dataseries to extract Intrinsic Mode Functions (IMFs).Secondly, Hilbert Transform (HT) is applied to eachIMF to obtain required frequency domain informa-tion (i.e., IF). The EMD is a shifting process thatcan be used to extract IMFs from a data series X(s).The IMF is defined as a monocomponent by satis-fying the criterias as following:

1 has the number of zero crossings and extremaone difference at most,

2 symmetric with respect to the local mean, and

3 the X(s) should has at least two extrema.

After the procedure of EMD, n IMFs, namely,IMF1, IMF2, IMF3, . . . , IMFn, and the

residuals(rn), denoted as

X(s)=n

∑j=1

Cj+rn

Equation 12 The decomposition of an input signalbased on the EMD provided in Equation 12, are ex-tracted from X(s). The residuals (rn) is a data serieswhich is the remainder series of target data series af-ter the EMD shifting process removes all the IMFsfrom the original target data series.

The procedure of EMD, different from theFourier and Wavelet Decomposition, is fully data-driven. By being adaptive and unsupervised, theEMD improves the efficiency of signal decompo-sition and can be applied to the non-linear and non-stationary signal (details on the procedure of theEMD please refer to [77]). After the EMD, the HTis then applied to each IMF

Yj (s)=1π

∫ ∞

−∞

Cj (τ)s−τ

dτ

Equation 13 Transforming each IMF j extracted bythe EMD into Y j(s)

Each IMFj can be represented by the conju-gate pair of Y j(s) and C j(s), hence can be repre-sented by an analytical signal Z(s) = C j(s) + iY j(s)= a(s)eiq(s), in which the amplitude

aj (s)=√

Cj (s)2+Yj (s)

2

Equation 14 The computation of amplitudes in HT

and phase q j(s) = arctan(Y j(s)/C j(s)). Based on thedefinition stated above, the IF j can be derived byapplying a derivative on q j(s) (i.e. w j=dq j(s)/ds).Then, an analytical representation of X(s), can bederived

X(s)=n

∑j=1

aj (s) [i∫

ωj (s)ds]

Equation 15 Analytical representation of the inputsignal based on the HHT

Originally, the EMD was proposed to decom-pose one-dimensional data. To construct a 2D-HHT, the concept of EMD was extended to 2D inthis study based on the concept listed as follows:

1 identify the extrema (maxima and minima) ofthe image by sliding a 3-by-3 grid;

2 generate two smooth 2D surfaces to fit the foundmaxima and minima;


Figure 9. The illustration of the data processing in this study, the application of 2D EMD on IAPS picture1120

3 compute the local mean by averaging two sur-faces; and

4 the equation of applying 2D-EMD then can berewrite from Equation 12, to

f(x,y)=∑nj=1 Cj (x,y)+rn (x,y).

Equation 16 The decomposition of a 2D input sig-nal based on the 2D-EMD

The data processing done in this study is illus-trated in Figure 9. The original image (1024x768resolutions) was first down-sampled to 128x128resolutions. The color setting was changed fromRGB color into gray color. Second, the 2D EMD isapplied to the image. For extracting IFs from IMFs,this study applies the concept of partial HT by ap-plying 1D HT to each orientation (i.e. each row andeach column) and unit, in order to extract spatial-frequency features that account for different orien-tations of visual stimulations [75]. The IF analysismethod used in this study was inspired by the workin [79] which provides a show case on estimatingthe changes of IF data series. This study mainlyadopts three indexes as follows: 1) F Q IMF j rep-resents frequency value in the 1st quarter of the his-togram area of IFx; 2) A I IMF j represents the ra-tio between the 1st and the 2nd halves of the his-togram area of IF j; 3) M I IMF j represents the ra-tio between the maxima found in the 1st and 2nd

halves of the histogram area of IF j. This studyapplies totally 12 features listed below as follows:1) The vertical side (the direction of applying 1D

HT) F Q IMF1, the horizontal side (the directionof applying 1D HT) F Q IMF1, the vertical sideA I IMF1, the horizontal side A I IMF1, the ver-tical side M I IMF1, the horizontal side M I IMF1;2) the vertical side F Q IMF2, the horizontal sideF Q IMF2, the vertical side A I IMF2, the horizon-tal side A I IMF2, the vertical side M I IMF2, thehorizontal side M I IMF2.

We totally acquired 318 rows of the feature vec-tor from the collected data set. The method that weuse to build the prediction model is introduced inthe following section.

4.3.2 Model Building

Models were built to predict the emotion rat-ings rated by subjects in terms of valence andarousal through SAM. The prediction of valenceand arousal can be real number herein according tothe definition of valence and arousal in the dimen-sional theory of emotion [11]. Besides the XCSF,this study also applies several well-known machine-learning techniques for comparison purpose. Zero-R is a majority voting learning scheme that predictsthe majority class in any data set. In a classifi-cation task, the Zero-R classifies an instance intothe majority class, whereas in a prediction task, theZero-R predicts the mean value of all the instances.Thus, the performance of the Zero-R can be consid-ered as a baseline performance of the classificationclass, which should be beaten by any algorithm thatlearns decision boundaries from the data set withoutover-fitting. One-layer method such as LR [80] and

Figure 9T providedX(s). Ththe remathe EMIMFs fro

The pFourier data-drivunsupervefficiencapplied signal (dplease rethen app

Equation the EMD

Each IMpair of represen+ iYj(s) =

Equation

and phasthe defiderived

The illustration

d inEquationhe residuals (ainder series

MD shifting om the originprocedure ofand Wavele

ven.By bvised, the cy of signal to the non-

details on thefer to [77]).plied to each

Y��s� � π13Transform

D into Yj(s)

MFj can be ref Yj(s) andnted by an an= a(s)ei(s), in

a��s� � �

n 14The comp

sej(s) = arcinition statedby applying

n of the data p

n 12, are e(rn) is a data s of target da

process remnal target datf EMD, diffeet Decomposbeing ad

EMD imdecompositi-linear and he procedure After the EMIMF

1π�

C��τ�s � τ

�

��

ming each IMFj

epresented byd Cj(s), henalytical signn which the a

�C��s�� Y�

putation of am

ctan(Yj(s)/Cjd above, th

g a derivativ

processing in t

extracted froseries whichata series aftmoves all tta series. ferent from tsition, is ful

daptive amproves tion and can non-stationa

e of the EMMD, the HT

�τ dτ

j extracted by

y the conjugaence can nal Z(s) = Cj(amplitude

��s��

mplitudes in H

j(s)). Based he IFj can ve onj(s) (i

this study, the

om h is fter the

the lly

and the be

ary MD T is

ate be (s)

HT

on be

i.e.

jre

Eqsig

Odea toas 1.of2.fo3.su4.be

Eqba Thill(1sasecoim

e application o

j=dj(s)/ds). epresentation

X�s�

quation 15 Angnal based on

Originally, tecompose on2D-HHT, th

o 2D in this ss follows:

.identify thef the image b.generate twound maxima.compute theurfaces; and .the equatione rewrite from

f�x, y�quation 16Theased on the 2D

he data prolustrated in1024x768 rampled to 1etting was cholor. Secondmage. For e

of 2D EMD on

Then, n of X(s), can

��a��s��

��

nalytical repren the HHT

the EMD ne-dimensionhe concept ofstudy based

extrema (mby sliding a 3

wo smooth 2Da and minimae local mea

n of applyinmEquation 1

� ∑ C��x,��

e decompositiD-EMD

ocessing donFigure 9. Tresolutions) 128x128 reshanged from d, the 2D EMextracting IF

n IAPS pictur

an an be derived

i �ω��s� ds�

esentation of th

was propnal data. To f EMD was on the conc

maxima and 3-by-3 grid; D surfaces ta;

an by averag

ng 2D-EMD 12, to

, y� � ��x, yion of a 2D in

ne in this The origina

was firstsolutions. TRGB color

MD is applieFs from IM

e 1120

analytical

�

he input

osed to construct extended ept listed

minima)

to fit the

ging two

then can

y�. nput signal

study is al image t down-

The color into gray ed to the

MFs, this


multi-layered method with transfer function suchas Radial-Basis-Function (RBF) Network [69] wereused in this study. The LOOCV which leaves onesample out at each time as a testing set and theremaining samples as a training set was used formodel building.

The XCSF used in this study was adopted fromthe Java implementation version on XCSF con-tributed by Stalph and Butz (2009) [67]. For pa-rameters setting, a = 1.0; b = 0.1; d = 0.1; l = 1.0;qGA = 50; e0 = 0.5; drls = 1000; qdel = 20; c = 1.0;m = 1.0; qsub = 20; the GA subsumption was turnedon. Although the maximal population size N wasset to 6,400˜10,000 to maximize the performance ofXCSF, the number of classifiers quickly convergedto 5,400 during the model training. To examinethe performance of the system, e0 was set to var-ious values. However, it appears relatively smalleffect on the learning performance in regard to thelearning speed and system error. During the modeltraining, the XCSF was sequentially presented with20,000 instances randomly selected from the train-ing dataset.





The performance evaluation based on MAE andthe standard deviation (represents by SD) of theMAEs achieved by the methods used is provided inTable 2. The ZeroR represents the Zero-R classifier,LinearReg represents the LR model, and RBFNetrepresents the RBF network. The number of nodes

(clusters) of the RBF network was set to 200 basedon the result of the examination on the performancechanges caused by the number of nodes.

The MAE of a prediction model which predictsat random on the value of valence and arousal is 4.0.Hence, the MAE 1.453±1.076 achieved by the LRseems to be fair.

Table 2. The performance achieved by benchmarkclassifiers and XCSF

The MAE achieved by the RBF network is0.949±0.747, which further shows a reduction ofthe error by 35%. This result indicates the existenceof the non-linearity characteristic of the datasetcollected. The MAE achieved by the XCSF was0.950±0.755. The equivalence in the performanceof RBF network and XCSF indicates the capabil-ity of XCSF on mapping non-linear functions. Themechanism of the XCSF in model building by man-aging a set of linear classifiers seems to be compa-rable to the multi-layered based method with non-linear transfer function. To further examine the per-formance of the XCSF, the MAEs of the XCSF oneach valence and arousal value are also providedin Figure 10 and Figure 11. To compare the MAEachieved by the XCSF, the MAEs that achieved byuniRand, a classifier that makes predictions in a uni-formly random manner are also included in thesefigures.

The performance of the XCSF in predictingeach valence value is illustrated in Figure 10 inwhich x-axis represents the valence value, y-axisrepresents the MAE value. The MAEs that the clas-sifiers achieved on each valence are represented bythree bars. The right most bar represents the MAEachieved by the XCSF. The MAEs of uniRand and

Zero-R classifier, LinearReg represents the LR model, and RBFNet represents the RBF network. The number of nodes (clusters) of the RBF network was set to 200 based on the result of the examination on the performance changes caused by the number of nodes. The MAE of a prediction model which predicts at random on the value of valence and arousal is 4.0. Hence, the MAE 1.453±1.076 achievedby theLR seems to be fair.

Table 2The performance achieved by benchmark classifiers and XCSF

Prediction Results Affective Dimension Valence Arousal

Method Statistics ZeroR MAE 1.617 1.491

SD of MAE 1.110 1.065 LinearReg MAE 1.453 1.427

SD of MAE 1.076 1.083 RBFNet MAE 0.950 1.471

SD of MAE 0.747 1.021 XCSF MAE 0.950 1.461

SD of MAE 0.755 1.011 The MAE achieved by the RBF network is 0.949±0.747, which further shows a reduction of the error by 35%.This result indicates the existence of the non-linearity characteristic of the dataset collected. The MAE achieved by the XCSF was 0.950±0.755. The equivalence in the performance of RBF network and XCSF indicates the capability of XCSF on mapping non-linear functions.The mechanism of the XCSF inmodel building by managing a set of linear classifiers seems to be comparable to the multi-layered based method with non-linear transfer function. To further examine the performance of the XCSF, the MAEs of the XCSF on each valence and arousal value are also provided in Figure 10andFigure 11. To compare the MAE achieved by the XCSF, the MAEs that achieved by uniRand, a classifier that makes predictions in a uniformly random manner are also included in these figures. The performance of the XCSF in predicting each valence value is illustrated in Figure 10 in which x-axis represents the valence value, y-axis represents the MAE value. The MAEs that the classifiers achieved on each valence are represented by three bars. The right most bar represents the MAE achieved by the XCSF. The MAEs of uniRand and ZeroR are represented by the first and the second bar.

Figure 10The MAE of XCSF on each valence value. The standard deviation of MAE made by XCSF is nearly 0.58~0.91. Based on the SAM ratings, the maximal MAE is 8 and minimal MAE is 0.

Figure 11The MAE of XCSF on each arousal value. The standard deviation of MAE made by XCSF is nearly 0.26~0.69. Based on the SAM ratings, the maximal MAE is 8 and minimal MAE is 0. The MAE achieved by the XCSF is smaller than the MAE achieved by uniRand and ZeroR at most ratings (i.e. the value of valence and arousal). The MAE of XCSF is only larger than the MAE of ZeroR at the ratings near the mean values. A skew on the value of MAEs is observed for the largest and lowest valence values (i.e. 0~1 and 8). This is possibly due to the sample size of these ratings, since the numbers of samples of valence equals to 0, 1, and 8 are smaller. This finding suggests that insufficient sample size of a class (e.g., valence = 8) may lead to bad performance of XCSF in predicting the corresponding output value. However, the MAE of XCSF at valence 4.0 was not the lowest, which indicates that the larger sample size only guarantees the efficacy of XCSF in function approximation instead of eliminating all exist errors. In our observation, approximately 30 samples (which is nearly 10% of the number in our collected data set) is sufficient for the XCSF to build a model to

012345

0 1 2 3 4 5 6 7 8

uniRand ZeroR XCSF

012345

0 1 2 3 4 5 6 7 8

uniRand ZeroR XCSF


multi-layered method with transfer function suchas Radial-Basis-Function (RBF) Network [69] wereused in this study. The LOOCV which leaves onesample out at each time as a testing set and theremaining samples as a training set was used formodel building.

The XCSF used in this study was adopted fromthe Java implementation version on XCSF con-tributed by Stalph and Butz (2009) [67]. For pa-rameters setting, a = 1.0; b = 0.1; d = 0.1; l = 1.0;qGA = 50; e0 = 0.5; drls = 1000; qdel = 20; c = 1.0;m = 1.0; qsub = 20; the GA subsumption was turnedon. Although the maximal population size N wasset to 6,400˜10,000 to maximize the performance ofXCSF, the number of classifiers quickly convergedto 5,400 during the model training. To examinethe performance of the system, e0 was set to var-ious values. However, it appears relatively smalleffect on the learning performance in regard to thelearning speed and system error. During the modeltraining, the XCSF was sequentially presented with20,000 instances randomly selected from the train-ing dataset.





The performance evaluation based on MAE andthe standard deviation (represents by SD) of theMAEs achieved by the methods used is provided inTable 2. The ZeroR represents the Zero-R classifier,LinearReg represents the LR model, and RBFNetrepresents the RBF network. The number of nodes

(clusters) of the RBF network was set to 200 basedon the result of the examination on the performancechanges caused by the number of nodes.

The MAE of a prediction model which predictsat random on the value of valence and arousal is 4.0.Hence, the MAE 1.453±1.076 achieved by the LRseems to be fair.

Table 2. The performance achieved by benchmarkclassifiers and XCSF

The MAE achieved by the RBF network is0.949±0.747, which further shows a reduction ofthe error by 35%. This result indicates the existenceof the non-linearity characteristic of the datasetcollected. The MAE achieved by the XCSF was0.950±0.755. The equivalence in the performanceof RBF network and XCSF indicates the capabil-ity of XCSF on mapping non-linear functions. Themechanism of the XCSF in model building by man-aging a set of linear classifiers seems to be compa-rable to the multi-layered based method with non-linear transfer function. To further examine the per-formance of the XCSF, the MAEs of the XCSF oneach valence and arousal value are also providedin Figure 10 and Figure 11. To compare the MAEachieved by the XCSF, the MAEs that achieved byuniRand, a classifier that makes predictions in a uni-formly random manner are also included in thesefigures.

The performance of the XCSF in predictingeach valence value is illustrated in Figure 10 inwhich x-axis represents the valence value, y-axisrepresents the MAE value. The MAEs that the clas-sifiers achieved on each valence are represented bythree bars. The right most bar represents the MAEachieved by the XCSF. The MAEs of uniRand and


ZeroR are represented by the first and the secondbar.

Figure 10. The MAE of XCSF on each valencevalue. The standard deviation of MAE made byXCSF is nearly 0.58˜0.91. Based on the SAM

ratings, the maximal MAE is 8 and minimal MAEis 0.

Figure 11. The MAE of XCSF on each arousalvalue. The standard deviation of MAE made byXCSF is nearly 0.26˜0.69. Based on the SAM

ratings, the maximal MAE is 8 and minimal MAEis 0.

The MAE achieved by the XCSF is smaller thanthe MAE achieved by uniRand and ZeroR at mostratings (i.e. the value of valence and arousal). TheMAE of XCSF is only larger than the MAE of Ze-roR at the ratings near the mean values. A skew onthe value of MAEs is observed for the largest andlowest valence values (i.e. 0˜1 and 8). This is pos-sibly due to the sample size of these ratings, sincethe numbers of samples of valence equals to 0, 1,and 8 are smaller. This finding suggests that insuf-ficient sample size of a class (e.g., valence = 8) maylead to bad performance of XCSF in predicting thecorresponding output value. However, the MAE ofXCSF at valence 4.0 was not the lowest, which in-dicates that the larger sample size only guarantees

the efficacy of XCSF in function approximation in-stead of eliminating all exist errors. In our obser-vation, approximately 30 samples (which is nearly10% of the number in our collected data set) is suf-ficient for the XCSF to build a model to predict avalence value in our collected dataset. On the otherhand, this phenomenon happened in valence 0 and8 could also be explained by a psychological ap-proach. That is, some of the subjects reported thatthey tended to rate the values in the middle of thescale rather than those values that represent extremeemotional experiences. This may cause the non-linear characteristics of the distances between thelevels of valence and arousal. Further clarificationis required for this issue; an appropriate transfor-mation may be applied to the data to improve theresult.

Similar results can be found in Figure 11,in which for the ZeroR, the prediction was setto 3.937. The MAE achieved by the XCSF ineach level of arousal substantially outperformed theMAE of uniRand and ZeroR. However, the Figure11 shows the increase of the MAEs achieved bythe XCSF at each level of arousal. This could beexplained by that most subjects reported that dur-ing the experiment, they confused the definition of“being aroused” with “the tendency of valence”.The reaction of the subjects is possibly caused bythe cultural difference, but similar results were nothighlighted previously in the research communitythat applies the IAPS and SAM.

Gender information was not used for modelbuilding, but the within-subject analysis was con-ducted. However, we found that without the ap-plying content analysis to the image, the effect ofindividual difference is relatively small. To furtherexamine the performance of the XCSF in this task,the ROC curve of the







SD of MAE 0.747 1.021 XCSF MAE 0.950 1.461




012345

0 1 2 3 4 5 6 7 8

uniRand ZeroR XCSF

012345

0 1 2 3 4 5 6 7 8

uniRand ZeroR XCSF







SD of MAE 0.747 1.021 XCSF MAE 0.950 1.461




012345

0 1 2 3 4 5 6 7 8

uniRand ZeroR XCSF

012345

0 1 2 3 4 5 6 7 8

uniRand ZeroR XCSF


Figure 12. The ROC curve made by XCSF onprediction of the event when valence of the image

is rated smaller than 4 (neutral).


is rated larger than 4 (neutral)

XCSF in predicting the value of valence of animage being rated smaller or larger than 4 (i.e. 4 isthe value of valence that represents a neutral state)are provided in Figure 12 and Figure 13. The re-sult shows that the AUC achieved by the XCSF issignificantly (p-value < .000) larger than the ran-dom classifier the AUC of predicting valence < 4 is0.913, in which the AUC of predicting valence > 4is 0.914). However, the AUC achieved by the XCSFin predicting the value of arousal being smaller orlarger than 4 is relative small (p-value < .005). TheAUC of predicting arousal < 4 is 0.594 and theAUC of predicting arousal > 4 is 0.573. The accu-racy rates achieved by the XCSF on the cut-off pointare: 84.3% (for predicting valence < 4), 86.8% (va-lence > 4), and 53.1% (arousal < 4), and 56.0%

(arousal > 4). In addition, the ROC curve of RBFnetwork was also examined because the MAE madeby RBF network was favorable in comparison withthe MAE made by XCSF. The results are providedin Table 3.

Table 3. The results of ROC Curve Produced byXCSF and RBF network

To further identify the extracted knowledge, theprediction models built by the LR are provided inEquation 17 and Equation 18.

Valence = 3.2893 * F Q IMF1 col + 0.3651 *F Q IMF1 row + 2.6606 * A I IMF1 row + 0.4394* F Q IMF2 row - 0.2629 * A I IMF2 col + 2.335* A I IMF2 row- 1.0522

Equation 17 The model of predicting valence basedon spatial-frequency properties

Arousal = -2.4297 * F Q IMF1 col - 0.9253 *F Q IMF1 row + 0.1896 * A I IMF1 col - 1.2495* A I IMF1 row + 0.4578 * M I IMF1 col - 0.3721* A I IMF2 col + 0.1047 * M I IMF2 col + 7.0513

Equation 18 The model of predicting arousal basedon spatial-frequency properties

For building models by the LR, Akaike crite-rion was used for model selection and the M5’smethod was used for attribute selection, in whichall the co-linear attributes were excluded. The equa-tions show that F Q IMF1 col, F Q IMF1 row, andA I IMF2 row were the main factors that affectthe affective ratings during the experiment. TheF Q IMF1 col, F Q IMF1 row, and A I IMF2 rowshow positive relationship to the rating of valence.These results indicate that the stimulations fromhorizontal side are more effective than the stimu-lations from vertical side. The horizontal side ofthe image may contain abundant information. Con-versely, the affective characteristics of an image in

predict a valence value in our collected dataset. On the other hand, this phenomenon happened in valence 0 and 8 could also be explained by a psychological approach. That is, some of the subjects reported that they tended to rate the values in the middle of the scale rather than those values that represent extreme emotional experiences. This maycause the non-linear characteristics of the distances between the levels of valence and arousal. Further clarification is required for this issue; an appropriate transformationmay be applied to the data to improve the result. Similar results can be found inFigure 11, in which for the ZeroR, the prediction was set to 3.937. The MAE achieved by the XCSF in each level of arousal substantially outperformed the MAE of uniRand and ZeroR. However, the Figure 11shows the increase of the MAEs achieved by the XCSF at each level of arousal. This could be explained by that most subjects reported that during the experiment, they confused the definition of “being aroused” with “the tendency of valence”. The reaction of the subjects is possibly caused by the cultural difference, but similar results were not highlighted previously in the research community that applies the IAPS and SAM. Gender information was not used for model building, but the within-subject analysis was conducted. However, we found that without the applying content analysis to the image, the effect of individual difference is relatively small. To further examine the performance of the XCSF in this task, the ROC curve of the

Figure 12 The ROC curve made by XCSF on prediction of the event when valence of the image is rated smaller than 4 (neutral).

Figure 13The ROC curve made by XCSF on prediction of the event when valence of the image is rated larger than 4 (neutral)

XCSF in predicting the value of valence of an image being rated smaller or larger than 4 (i.e. 4 is the value of valence that represents a neutral state) are provided inFigure 12 andFigure 13. The result shows that the AUCachieved by the XCSF is significantly (p-value < .000) larger than the random classifier the AUC of predicting valence < 4 is 0.913, in which the AUC of predicting valence > 4 is 0.914). However, the AUCachieved bythe XCSF in predicting the value of arousal being smaller or larger than 4 isrelative small(p-value < .005).The AUC of predicting arousal < 4 is 0.594 and the AUC of predicting arousal > 4 is 0.573. The accuracy rates achieved by the XCSF on the cut-off point are: 84.3% (for predicting valence < 4), 86.8% (valence > 4), and 53.1% (arousal < 4), and 56.0% (arousal > 4). In addition, the ROC curve of RBF network was also examined because the MAE made by RBF network was favorable in comparison with the MAE made by XCSF. The results are provided inTable 3. Table 3The results of ROC Curve Produced by XCSF and RBF network

Prediction Target V< 4 V> 4 A< 4 A> 4 Method EstimationXCSF AUC 0.913 0.914 0.594 0.573

Accuracy 84.30% 86.80% 53.10% 56.00%RBFnet AUC 0.682 0.673 0.495 0.505

Accuracy 70.50% 65.20% 50.10% 63.30%V: Valence, A: Arousal.

To further identify the extracted knowledge,



















is rated smaller than 4 (neutral).


is rated larger than 4 (neutral)

XCSF in predicting the value of valence of animage being rated smaller or larger than 4 (i.e. 4 isthe value of valence that represents a neutral state)are provided in Figure 12 and Figure 13. The re-sult shows that the AUC achieved by the XCSF issignificantly (p-value < .000) larger than the ran-dom classifier the AUC of predicting valence < 4 is0.913, in which the AUC of predicting valence > 4is 0.914). However, the AUC achieved by the XCSFin predicting the value of arousal being smaller orlarger than 4 is relative small (p-value < .005). TheAUC of predicting arousal < 4 is 0.594 and theAUC of predicting arousal > 4 is 0.573. The accu-racy rates achieved by the XCSF on the cut-off pointare: 84.3% (for predicting valence < 4), 86.8% (va-lence > 4), and 53.1% (arousal < 4), and 56.0%

(arousal > 4). In addition, the ROC curve of RBFnetwork was also examined because the MAE madeby RBF network was favorable in comparison withthe MAE made by XCSF. The results are providedin Table 3.

Table 3. The results of ROC Curve Produced byXCSF and RBF network

To further identify the extracted knowledge, theprediction models built by the LR are provided inEquation 17 and Equation 18.

Valence = 3.2893 * F Q IMF1 col + 0.3651 *F Q IMF1 row + 2.6606 * A I IMF1 row + 0.4394* F Q IMF2 row - 0.2629 * A I IMF2 col + 2.335* A I IMF2 row- 1.0522

Equation 17 The model of predicting valence basedon spatial-frequency properties

Arousal = -2.4297 * F Q IMF1 col - 0.9253 *F Q IMF1 row + 0.1896 * A I IMF1 col - 1.2495* A I IMF1 row + 0.4578 * M I IMF1 col - 0.3721* A I IMF2 col + 0.1047 * M I IMF2 col + 7.0513

Equation 18 The model of predicting arousal basedon spatial-frequency properties

For building models by the LR, Akaike crite-rion was used for model selection and the M5’smethod was used for attribute selection, in whichall the co-linear attributes were excluded. The equa-tions show that F Q IMF1 col, F Q IMF1 row, andA I IMF2 row were the main factors that affectthe affective ratings during the experiment. TheF Q IMF1 col, F Q IMF1 row, and A I IMF2 rowshow positive relationship to the rating of valence.These results indicate that the stimulations fromhorizontal side are more effective than the stimu-lations from vertical side. The horizontal side ofthe image may contain abundant information. Con-versely, the affective characteristics of an image in


regard to making people feel aroused, is negativecorrelated with F Q IMF1 col and A I IMF1 row.In addition, the offset of the equation 3 is +7.0513.These results indicate that the effects of activationin motivational system due to a visual stimulus areinfluenced by the asymmetric of cortical receptorsresponsible for distinctive directions of the spatial-frequency visual stimulations.

5 Conclusion

The overall goal of this study was to build an intelli-gent machine that can classify images based on theiraffective characteristics, especially the classifica-tion based on features extracted from the spatial-frequency domain. To achieve this goal, two novelaffect detectors, two speed-up techniques for theXCS, and a novel 2D approach of applying the HHTmethod were developed. The developed systemswere built and validated using multiple human-subject experiments and compared with the existingrelated systems. The rest of this chapter presents theachieved objectives, main conclusions from eachcontribution chapter, and the future work that stemsfrom this research work.

5.1 Achieved Objectives

The following research objectives have beenfulfilled by this work to achieve the overall researchgoal.

– For the first time, high resolution spatial-frequency features were extracted from thegiven images and used in building an affectiveclassification model. By utilizing the proposednovel 2D feature-extraction method, the devel-oped algorithm readily demonstrated spatial-frequency calculation at a resolution that exist-ing 2D-FFT, wavelet-based methods, and HHTcannot.

– For the first time, conducted controlled experi-ments on this issue that adopt standard instru-ments which make the results cross-cultural andcomparable to future studies which follow thesame standard. The use of the dimensional the-ory of emotions in this study enabled the useof rich standard methodology and paradigms forconducting experiments. When without the use

of these methods, the obtained results were hardto compare and reproduce, leading the demon-strated techniques controversial.

In addition to achieving the above establishedresearch objectives, this work provided a detailedinvestigation and analysis of the models built forclassifying images. This analysis revealed that nomatter which the images are and who the subjectsare, the strength of a specific frequency band in aspecific directions on the influence of the affectivecharacteristics could be the same (i.e. image inde-pendent and user independent). Further, the intro-duced 2D-HHT method are not simply another wayof obtaining spatial-frequency features as the pro-posed methods fundamentally change the way thatan ordinary 2D-FFT functions can do for improvingresolution (e.g., windowing). Also that a standard2D-FFT does not use all available amplitude infor-mation and do not decompose the given image toextract frequency domain information, whereas thedeveloped HHT based 2D method effectively ex-ploit the hidden information in an image by usingEMD.

5.2 Main Conclusions

This section presents the main conclusions andhighlights from the two major contribution chapters(Chapter 3 and Chapter 4).

On the patterns in the affective image classifica-tion, two models were built and validated based onmultiple human-subject experiments. These mod-els were built based on HSV properties of imagesand the features extracted from spatial-frequencydomain through a 2D HHT method. By using theproposed 2D HHT method, this study obtains highresolution information in the spatial-frequency do-main. The result indicates that both the modelsshow comparable results to the results that were re-ported in the previous studies.

The XCSF was demonstrated to adequatelyapproximate the output landscape of the affec-tive image-classification problem in the collecteddatasets. The relationships between the used fea-tures and the affective characteristics of imageswere examined and shown in Equation 10, Equa-tion 11, Equation 17, and Equation 18. The prop-erty of images in the HSV and the spatial-frequencydomain is proved to be influential to the affective


characteristics of images. These user-independentresults are favorable because less object-detectiontechnique is required for model building. Moreover,there should be less possible interference due to in-dividual difference which should have been elimi-nated during the model-building process.

5.3 Future Work

For suggested future work, applications on in-dexing affective characteristics of the images on theinternet, or providing feedbacks to the users to im-prove the quality of life; for example, to ease peo-ple, or to excite people; are possible future work. Infuture, the results produced by this study should bereplicated using images obtain from sources otherthan the IAPS database, to validate the experimentalresults in regard to generality. Further examinationon the mechanisms and pathway between the affec-tive information contained in the spatial-frequencydomain and the cortical receptors in human eye, isalso suggested.

5.4 Closing Remarks

This research work has shown that images canbe indexed based on their affective characteristics.The 2D nature and the complexity in the imagesadded additional difficulties to the task. The useof HSV models and high-resolution 2D spatial-frequency feature-extraction method such as the2D-HHT can led the systems to build accurate,maximally general and compact models in index-ing various IAPS images as well as pictures in real-world. Effectively exploiting the combined powerof 2D-HHT and the XCSF, various real-value mod-eling and function approximation problems couldbe solved in a simple and straight forward manner.Understanding of the affective image-classificationtask reveals that the standard spatial-frequency fea-ture extraction methods do not exploit all availableinformation embedded in the amplitude matrix ofimages, whereas the developed 2D-HHT based sys-tems effectively exploit the embedded-features andspatial information during the model-building pro-cess. This study has shown a new perspective ofindexing images based on their affective character-istics, not just a new feature extraction method forindexing images, which is needed to replicate hu-man capabilities and should lead to various novelapplications.

6 Acknowledgement

This work was fully supported by the Tai-wan National Science Council under grant numbersMOST 103-2221-E-009 -139. This work was alsosupported in part by the ”Aim for the Top Univer-sity Plan” of the National Chiao-Tung Universityand Ministry of Education, Taiwan, R.O.C..

References[1] Kuo, W.J., et al., Intuition and Deliberation: Two

Systems for Strategizing in the Brain. Science,2009. 324(5926): p. 519-522.

[2] Chowdhury, R.M.M.I., G.D. Olsen, and J.W.Pracejus, Affective Responses to Images In PrintAdvertising: Affect Integration in a Simultane-ous Presentation Context. Journal of Advertising,2008. 37(3): p. 7-18.

[3] Chang, C., The Impacts of Emotion Elicited ByPrint Political Advertising on Candidate Evalua-tion. Media Psychology, 2001. 3(2): p. 91-118.

[4] Kyung-Sun, K., Effects of emotion control and taskon Web searching behavior. Information Process-ing & Management, 2008. 44(1): p. 373-385.

[5] Mitchell, T.M., Machine learning. 1997. BurrRidge, IL: McGraw Hill, 1997. 45.

[6] Orriols-Puig, A., J. Casillas, and E. Bernad-Mansilla, Genetic-based machine learning systemsare competitive for pattern recognition. Evolution-ary Intelligence, 2008. 1(3): p. 209-232.

[7] Russell, S. and P. Norvig, Artificial Intelligence:A Modern Approach (2nd Edition)2002: PrenticeHall.

[8] Wilson, S.W., Classifier fitness based on accuracy.Evol. Comput., 1995. 3(2): p. 149-175.

[9] Picard, R.W., Affective Computing2000, Cam-bridge MA: The MIT Press.

[10] Ortony, A. and T. Turner, What’s basic about basicemotions. Psychological review, 1990.

[11] Bradley, M.M., Emotional memory: a dimensionalanalysis, in Emotions: Essays on emotion theory,S.H.M.v. Goozen, N.E.v.d. Poll, and J.A. Sergeant,Editors. 1994, Lawrence Erlbaum: Hillsdale, NJ.p. 97-134.

[12] Bradley, M.M. and P.J. Lang, Emotion and moti-vation, in Handbook of Psychophysiology, J.T. Ca-cioppo, L.G. Tassinary, and G. Berntson, Editors.2007, Cambridge University Press: New York, NY.p. 581-607.


characteristics of images. These user-independentresults are favorable because less object-detectiontechnique is required for model building. Moreover,there should be less possible interference due to in-dividual difference which should have been elimi-nated during the model-building process.

5.3 Future Work

For suggested future work, applications on in-dexing affective characteristics of the images on theinternet, or providing feedbacks to the users to im-prove the quality of life; for example, to ease peo-ple, or to excite people; are possible future work. Infuture, the results produced by this study should bereplicated using images obtain from sources otherthan the IAPS database, to validate the experimentalresults in regard to generality. Further examinationon the mechanisms and pathway between the affec-tive information contained in the spatial-frequencydomain and the cortical receptors in human eye, isalso suggested.

5.4 Closing Remarks

This research work has shown that images canbe indexed based on their affective characteristics.The 2D nature and the complexity in the imagesadded additional difficulties to the task. The useof HSV models and high-resolution 2D spatial-frequency feature-extraction method such as the2D-HHT can led the systems to build accurate,maximally general and compact models in index-ing various IAPS images as well as pictures in real-world. Effectively exploiting the combined powerof 2D-HHT and the XCSF, various real-value mod-eling and function approximation problems couldbe solved in a simple and straight forward manner.Understanding of the affective image-classificationtask reveals that the standard spatial-frequency fea-ture extraction methods do not exploit all availableinformation embedded in the amplitude matrix ofimages, whereas the developed 2D-HHT based sys-tems effectively exploit the embedded-features andspatial information during the model-building pro-cess. This study has shown a new perspective ofindexing images based on their affective character-istics, not just a new feature extraction method forindexing images, which is needed to replicate hu-man capabilities and should lead to various novelapplications.

6 Acknowledgement

This work was fully supported by the Tai-wan National Science Council under grant numbersMOST 103-2221-E-009 -139. This work was alsosupported in part by the ”Aim for the Top Univer-sity Plan” of the National Chiao-Tung Universityand Ministry of Education, Taiwan, R.O.C..

References[1] Kuo, W.J., et al., Intuition and Deliberation: Two

Systems for Strategizing in the Brain. Science,2009. 324(5926): p. 519-522.

[2] Chowdhury, R.M.M.I., G.D. Olsen, and J.W.Pracejus, Affective Responses to Images In PrintAdvertising: Affect Integration in a Simultane-ous Presentation Context. Journal of Advertising,2008. 37(3): p. 7-18.

[3] Chang, C., The Impacts of Emotion Elicited ByPrint Political Advertising on Candidate Evalua-tion. Media Psychology, 2001. 3(2): p. 91-118.

[4] Kyung-Sun, K., Effects of emotion control and taskon Web searching behavior. Information Process-ing & Management, 2008. 44(1): p. 373-385.

[5] Mitchell, T.M., Machine learning. 1997. BurrRidge, IL: McGraw Hill, 1997. 45.

[6] Orriols-Puig, A., J. Casillas, and E. Bernad-Mansilla, Genetic-based machine learning systemsare competitive for pattern recognition. Evolution-ary Intelligence, 2008. 1(3): p. 209-232.

[7] Russell, S. and P. Norvig, Artificial Intelligence:A Modern Approach (2nd Edition)2002: PrenticeHall.

[8] Wilson, S.W., Classifier fitness based on accuracy.Evol. Comput., 1995. 3(2): p. 149-175.

[9] Picard, R.W., Affective Computing2000, Cam-bridge MA: The MIT Press.

[10] Ortony, A. and T. Turner, What’s basic about basicemotions. Psychological review, 1990.

[11] Bradley, M.M., Emotional memory: a dimensionalanalysis, in Emotions: Essays on emotion theory,S.H.M.v. Goozen, N.E.v.d. Poll, and J.A. Sergeant,Editors. 1994, Lawrence Erlbaum: Hillsdale, NJ.p. 97-134.

[12] Bradley, M.M. and P.J. Lang, Emotion and moti-vation, in Handbook of Psychophysiology, J.T. Ca-cioppo, L.G. Tassinary, and G. Berntson, Editors.2007, Cambridge University Press: New York, NY.p. 581-607.


[13] Lang, P.J., The motivational organization of emo-tion: Affect-reflex connections, in Emotions: Es-says on emotion theory1994, Lawrence Erlbaum:Hillsdale, NJ. p. 61-93.

[14] Lang, P.J., The Emotion Probe - Studies of Motiva-tion and Attention. American Psychologist, 1995.50(5): p. 372-385.

[15] Bolls, P.D., A. Lang, and R.F. Potter, The Effects ofMessage Valence and Listener Arousal on Atten-tion, Memory, and Facial Muscular Responses toRadio Advertisements. Communication Research,2001. 28: p. 627-651.

[16] Antonio R, D., Emotion in the perspective of an in-tegrated nervous system. Brain Research Reviews,1998. 26(2–3): p. 83-86.

[17] Bechara, A., The role of emotion in decision-making: Evidence from neurological patients withorbitofrontal damage. Brain and cognition, 2004.55(1): p. 30-40.

[18] LaBar, K.S. and R. Cabeza, Cognitive neuro-science of emotional memory. Nat Rev Neurosci,2006. 7(1): p. 54-64.

[19] Wu, Q., C. Zhou, and C. Wang, Content-BasedAffective Image Classification and Retrieval UsingSupport Vector Machines, in Affective Computingand Intelligent Interaction, J. Tao, T. Tan, and R.Picard, Editors. 2005, Springer Berlin / Heidelberg.p. 239-247.

[20] Joshi, D., et al., Aesthetics and Emotions in Im-ages. Signal Processing Magazine, IEEE, 2011.28(5): p. 94-115.

[21] Liu, N., et al., Associating Textual Features withVisual Ones to Improve Affective Image Classi-fication, in Affective Computing and IntelligentInteraction, S. D’Mello, et al., Editors. 2011,Springer Berlin / Heidelberg. p. 195-204.

[22] Machajdik, J. and A. Hanbury, Affective imageclassification using features inspired by psychol-ogy and art theory, in Proceedings of the inter-national conference on Multimedia2010, ACM:Firenze, Italy. p. 83-92.

[23] Zhang, H., et al., Analyzing Emotional Seman-tics of Abstract Art Using Low-Level Image Fea-tures, in Advances in Intelligent Data Analysis X,J. Gama, E. Bradley, and J. Hollmn, Editors. 2011,Springer Berlin / Heidelberg. p. 413-423.

[24] Lang, P.J., M.M. Bradley, and B.N. Cuthbert, Inter-national affective picture system (IAPS): Affectiveratings of pictures and instruction manual, 2008:University of Florida, Gainesville, FL.

[25] Sanchez-Navarro, J., et al., Psychophysiological,behavioral, and cognitive indices of the emotionalresponse: A factor-analytic study. Spanish Journalof Psychology, 2008. 11(1): p. 16-25.

[26] Kensinger, E.A., R.J. Garoff-Eaton, and D.L.Schacter, Effects of emotion on memory speci-ficity: Memory trade-offs elicited by negative visu-ally arousing stimuli. Journal of Memory and Lan-guage, 2007. 56(4): p. 575-591.

[27] Lang, P.J., Behavioral treatment and bio-behavioral assessment: Computer applications,in Technology in Mental Health Care DeliverySystems, J. Sidowski, J. Johnson, and T. Williams,Editors. 1980, Ablex Pub. Corp.: Norwood, NJ. p.119-137.

[28] Morris, J.D., Observations: SAM: the Self-Assessment Manikin; an efficient cross-culturalmeasurement of emotional response. Journal of ad-vertising research, 1995. 35(6): p. 63-68.

[29] Mehrabian, A. and J.A. Russell, An approach toenvironmental psychology1974, Cambridge, MA:the MIT Press.

[30] Bradley, M.M. and P.J. Lang, The International Af-fective Digitized Sounds (2nd Edition; IADS-2):Affective ratings of sounds and instruction manual.University of Florida, Gainesville, FL, Tech. Rep.B-3, 2007.

[31] Holland, J.H., Adaptation in Natural and ArtificialSystem1992, Cambridge, MA, USA: MIT Press.

[32] Wilson, S.W., ZCS: A Zeroth Level Classifier Sys-tem. Evolutionary Computation, 1994. 2(1): p. 1-18.

[33] Wilson, S.W., Get Real! XCS with Continuous-Valued Inputs. Learning Classifier Systems, 2000.1813: p. 209-219.

[34] Stone, C. and L. Bull, For Real! XCS withContinuous-Valued Inputs. Evolutionary Computa-tion, 2003. 11(3): p. 299-336.

[35] Dam, H.H., H.A. Abbass, and C. Lokan, Be real!XCS with continuous-valued inputs, in Proceed-ings of the 2005 workshops on Genetic and evo-lutionary computation2005, ACM: Washington,D.C. p. 85-87.

[36] Lanzi, P.L. Adding memory to XCS. in IEEEWorld Congress on Computational Intelligence.1998.

[37] Wilson, S.W., Compact Rulesets from XCSI, inAdvances in Learning Classifier Systems, P. Lanzi,W. Stolzmann, and S. Wilson, Editors. 2002,Springer Berlin / Heidelberg. p. 65-92.


[38] Dam, H.H., H.A. Abbass, and C. Lokan, DXCS:an XCS system for distributed data mining, in Pro-ceedings of the 2005 conference on Genetic andevolutionary computation2005, ACM: WashingtonDC, USA. p. 1883-1890.

[39] Wilson, S.W., Classifiers that approximate func-tions. Natural Computing, 2002. 1(2): p. 211-234.

[40] Lanzi, P.L., et al., Generalization in the XCSFClassifier System: Analysis, Improvement, andExtension. Evol. Comput., 2007. 15(2): p. 133-168.

[41] Bull, L., E. Bernad-Mansilla, and J. Holmes,Learning Classifier Systems in Data Mining: AnIntroduction, in Learning Classifier Systems inData Mining, L. Bull, E. Bernad-Mansilla, and J.Holmes, Editors. 2008, Springer Berlin / Heidel-berg. p. 1-15.

[42] Butz, M., et al., Knowledge Extraction and Prob-lem Structure Identification in XCS, in ParallelProblem Solving from Nature - PPSN VIII, X. Yao,et al., Editors. 2004, Springer Berlin / Heidelberg.p. 1051-1060.

[43] Muruzbal, J., A probabilistic classifier system andits application in data mining. Evol. Comput.,2006. 14(2): p. 183-221.

[44] Orriols-Puig, A., J. Casillas, and E. Bernad-Mansilla, First approach toward on-line evolutionof association rules with learning classifier sys-tems, in Proceedings of the 2008 GECCO confer-ence companion on Genetic and evolutionary com-putation2008, ACM: Atlanta, GA, USA. p. 2031-2038.

[45] Dam, H., C. Lokan, and H. Abbass, EvolutionaryOnline Data Mining: An Investigation in a Dy-namic Environment, in Evolutionary Computationin Dynamic and Uncertain Environments, S. Yang,Y.-S. Ong, and Y. Jin, Editors. 2007, SpringerBerlin / Heidelberg. p. 153-178.

[46] Quirin, A., et al. Analysis and evaluation of learn-ing classifier systems applied to hyperspectral im-age classification. in Intelligent Systems Designand Applications, 2005. ISDA ’05. Proceedings.5th International Conference on. 2005.

[47] Butz, M., et al., Effective and Reliable Online Clas-sification Combining XCS with EDA Mechanisms,in Scalable Optimization via Probabilistic Model-ing, M. Pelikan, K. Sastry, and E. CantPaz, Editors.2006, Springer Berlin / Heidelberg. p. 249-273.

[48] Akbar, M.A. and M. Farooq, Application of evolu-tionary algorithms in detection of SIP based flood-ing attacks, in Proceedings of the 11th Annualconference on Genetic and evolutionary computa-tion2009, ACM: Montreal, Canada. p. 1419-1426.

[49] Armano, G., A. Murru, and F. Roli, Stock Mar-ket Prediction by a Mixture of Genetic-Neural Ex-perts. International Journal of Pattern Recognitionand Artificial Intelligence (IJPRAI), 2002. 16(5):p. 501-526.

[50] Tsai, W.-C. and A.-P. Chen. Global Asset Alloca-tion Using XCS Experts in Country-Specific ETFs.in Convergence and Hybrid Information Technol-ogy, 2008. ICCIT ’08. Third International Confer-ence on. 2008.

[51] Sprogar, M., M. Sprogar, and M. Colnaric, Au-tonomous evolutionary algorithm in medical dataanalysis. Computer Methods and Programs inBiomedicine, 2005. 80(Supplement 1): p. S29-S38.

[52] Passaro, A., F. Baronti, and V. Maggini, Explor-ing relationships between genotype and oral can-cer development through XCS, in Proceedings ofthe 2005 workshops on Genetic and evolutionarycomputation2005, ACM: Washington, D.C. p. 147-151.

[53] Baronti, F., et al., Machine learning contribution tosolve prognostic medical, in Outcome prediction incancer, A. Taktak and A.C. Fisher, Editors. 2007,Elsevier.

[54] Bernauer, A., et al., Combining Software andHardware LCS for Lightweight On-chip Learning,in Organic Computing — A Paradigm Shift forComplex Systems, C. Mller-Schloer, H. Schmeck,and T. Ungerer, Editors. 2011, Springer Basel. p.253-265.

[55] Bernauer, A., et al., Autonomous multi-processor-SoC optimization with distributed learning classi-fier systems XCS, in Proceedings of the 8th ACMinternational conference on Autonomic comput-ing2011, ACM: Karlsruhe, Germany. p. 213-216.

[56] Armano, G., NXCS Experts for Financial Time Se-ries Forecasting, in Applications of Learning Clas-sifier Systems, L. Bull, Editor 2004, Springer. p.68-91.

[57] Chen, A.-P. and Y.-H. Chang. Using extended clas-sifier system to forecast S&P futures based on con-trary sentiment indicators. in Evolutionary Compu-tation, 2005. The 2005 IEEE Congress on. 2005.

[58] Chen, A.-P., et al., Applying the Extended Classi-fier System to Trade Interest Rate Futures Basedon Technical Analysis, in Proceedings of the 2008Eighth International Conference on Intelligent Sys-tems Design and Applications - Volume 032008,IEEE Computer Society. p. 598-603.

[59] Shankar, A. and S.J. Louis, XCS for Personaliz-ing Desktop Interfaces. Evolutionary Computation,IEEE Transactions on, 2010. 14(4): p. 547-560.


[38] Dam, H.H., H.A. Abbass, and C. Lokan, DXCS:an XCS system for distributed data mining, in Pro-ceedings of the 2005 conference on Genetic andevolutionary computation2005, ACM: WashingtonDC, USA. p. 1883-1890.

[39] Wilson, S.W., Classifiers that approximate func-tions. Natural Computing, 2002. 1(2): p. 211-234.

[40] Lanzi, P.L., et al., Generalization in the XCSFClassifier System: Analysis, Improvement, andExtension. Evol. Comput., 2007. 15(2): p. 133-168.

[41] Bull, L., E. Bernad-Mansilla, and J. Holmes,Learning Classifier Systems in Data Mining: AnIntroduction, in Learning Classifier Systems inData Mining, L. Bull, E. Bernad-Mansilla, and J.Holmes, Editors. 2008, Springer Berlin / Heidel-berg. p. 1-15.

[42] Butz, M., et al., Knowledge Extraction and Prob-lem Structure Identification in XCS, in ParallelProblem Solving from Nature - PPSN VIII, X. Yao,et al., Editors. 2004, Springer Berlin / Heidelberg.p. 1051-1060.

[43] Muruzbal, J., A probabilistic classifier system andits application in data mining. Evol. Comput.,2006. 14(2): p. 183-221.

[44] Orriols-Puig, A., J. Casillas, and E. Bernad-Mansilla, First approach toward on-line evolutionof association rules with learning classifier sys-tems, in Proceedings of the 2008 GECCO confer-ence companion on Genetic and evolutionary com-putation2008, ACM: Atlanta, GA, USA. p. 2031-2038.

[45] Dam, H., C. Lokan, and H. Abbass, EvolutionaryOnline Data Mining: An Investigation in a Dy-namic Environment, in Evolutionary Computationin Dynamic and Uncertain Environments, S. Yang,Y.-S. Ong, and Y. Jin, Editors. 2007, SpringerBerlin / Heidelberg. p. 153-178.

[46] Quirin, A., et al. Analysis and evaluation of learn-ing classifier systems applied to hyperspectral im-age classification. in Intelligent Systems Designand Applications, 2005. ISDA ’05. Proceedings.5th International Conference on. 2005.

[47] Butz, M., et al., Effective and Reliable Online Clas-sification Combining XCS with EDA Mechanisms,in Scalable Optimization via Probabilistic Model-ing, M. Pelikan, K. Sastry, and E. CantPaz, Editors.2006, Springer Berlin / Heidelberg. p. 249-273.

[48] Akbar, M.A. and M. Farooq, Application of evolu-tionary algorithms in detection of SIP based flood-ing attacks, in Proceedings of the 11th Annualconference on Genetic and evolutionary computa-tion2009, ACM: Montreal, Canada. p. 1419-1426.

[49] Armano, G., A. Murru, and F. Roli, Stock Mar-ket Prediction by a Mixture of Genetic-Neural Ex-perts. International Journal of Pattern Recognitionand Artificial Intelligence (IJPRAI), 2002. 16(5):p. 501-526.

[50] Tsai, W.-C. and A.-P. Chen. Global Asset Alloca-tion Using XCS Experts in Country-Specific ETFs.in Convergence and Hybrid Information Technol-ogy, 2008. ICCIT ’08. Third International Confer-ence on. 2008.

[51] Sprogar, M., M. Sprogar, and M. Colnaric, Au-tonomous evolutionary algorithm in medical dataanalysis. Computer Methods and Programs inBiomedicine, 2005. 80(Supplement 1): p. S29-S38.

[52] Passaro, A., F. Baronti, and V. Maggini, Explor-ing relationships between genotype and oral can-cer development through XCS, in Proceedings ofthe 2005 workshops on Genetic and evolutionarycomputation2005, ACM: Washington, D.C. p. 147-151.

[53] Baronti, F., et al., Machine learning contribution tosolve prognostic medical, in Outcome prediction incancer, A. Taktak and A.C. Fisher, Editors. 2007,Elsevier.

[54] Bernauer, A., et al., Combining Software andHardware LCS for Lightweight On-chip Learning,in Organic Computing — A Paradigm Shift forComplex Systems, C. Mller-Schloer, H. Schmeck,and T. Ungerer, Editors. 2011, Springer Basel. p.253-265.

[55] Bernauer, A., et al., Autonomous multi-processor-SoC optimization with distributed learning classi-fier systems XCS, in Proceedings of the 8th ACMinternational conference on Autonomic comput-ing2011, ACM: Karlsruhe, Germany. p. 213-216.

[56] Armano, G., NXCS Experts for Financial Time Se-ries Forecasting, in Applications of Learning Clas-sifier Systems, L. Bull, Editor 2004, Springer. p.68-91.

[57] Chen, A.-P. and Y.-H. Chang. Using extended clas-sifier system to forecast S&P futures based on con-trary sentiment indicators. in Evolutionary Compu-tation, 2005. The 2005 IEEE Congress on. 2005.

[58] Chen, A.-P., et al., Applying the Extended Classi-fier System to Trade Interest Rate Futures Basedon Technical Analysis, in Proceedings of the 2008Eighth International Conference on Intelligent Sys-tems Design and Applications - Volume 032008,IEEE Computer Society. p. 598-603.

[59] Shankar, A. and S.J. Louis, XCS for Personaliz-ing Desktop Interfaces. Evolutionary Computation,IEEE Transactions on, 2010. 14(4): p. 547-560.


[60] Butz, M. and S. Wilson, An Algorithmic Descrip-tion of XCS, in Advances in Learning ClassifierSystems, P. Luca Lanzi, W. Stolzmann, and S. Wil-son, Editors. 2001, Springer Berlin / Heidelberg. p.267-274. Wilson, S.W., Generalization in the XCSClassifier System, 1998.

[61] Mikels, J., et al., Emotional category data on im-ages from the international affective picture sys-tem. Behavior Research Methods, 2005. 37(4): p.626-630.

[62] Bradley, M.M. and P.J. Lang, Measuring emotion:the self-assessment manikin and the semantic dif-ferential. Journal of Behavior Therapy and Experi-mental Psychiatry, 1994. 25: p. 49-59.

[63] Cohen, I., et al., Facial expression recognition fromvideo sequences: temporal and static modeling.Computer Vision and Image Understanding, 2003.91(1–2): p. 160-187.

[64] Kim, K.H., S.W. Bang, and S.R. Kim, Emotionrecognition system using short-term monitoring ofphysiological signals. Medical & Biological Engi-neering & Computing, 2004. 42(3): p. 419-427.

[65] Lang, P.J., M.M. Bradley, and B.N. Cuthbert, Inter-national Affective Picture System (IAPS), in Techi-nal Manual and Affective Ratings1999, The Cen-ter for Research in Psychophysiology, Universityof Florida: Gainesville, FL.

[66] Stalph, P.O. and M.V. Butz, Documentation ofJavaXCSF, 2009: Retrieved from University ofWurzburg, Cognitive Bodyspaces: Learning andBehavior website.

[67] Butz, M.V., P.L. Lanzi, and S.W. Wilson, FunctionApproximation With XCS: Hyperellipsoidal Con-ditions, Recursive Least Squares, and Compaction.Evolutionary Computation, IEEE Transactions on,2008. 12(3): p. 355-376.

[68] Hall, M., et al., The WEKA data mining software:an update. SIGKDD Explor. Newsl., 2009. 11(1):p. 10-18.

[69] Lee, P.-M., Y. Teng, and T.-C. Hsiao. XCSF forprediction on emotion induced by image based on

dimensional theory of emotion. in Proceedings ofthe fourteenth international conference on Geneticand evolutionary computation conference compan-ion. 2012. ACM.

[70] Holland, J.H. and J.S. Reitman, Cognitive sys-tems based on adaptive algorithms. SIGART Bull.,1977(63): p. 49-49.

[71] Li, S. and B. Yang, Multifocus image fusion usingregion segmentation and spatial frequency. Imageand Vision Computing, 2008. 26(7): p. 971-979.

[72] Leonard, H., et al., Brief Report: DevelopingSpatial Frequency Biases for Face Recognition inAutism and Williams Syndrome. Journal of Autismand Developmental Disorders, 2011. 41(7): p. 968-973.

[73] John G, D., Two-dimensional spectral analysis ofcortical receptive field profiles. Vision Research,1980. 20(10): p. 847-856.

[74] Webster, M.A. and R.L. De Valois, Relationshipbetween spatial-frequency and orientation tuningof striate-cortex cells. J. Opt. Soc. Am. A, 1985.2(7): p. 1124-1132.

[75] Kobayashi, K., et al., Head and body sway in re-sponse to vertical visual stimulation. Acta Oto-laryngologica, 2005. 125(8): p. 858-862.

[76] Huang, N., et al., The empirical mode decompo-sition and the Hilbert spectrum for nonlinear andnon-stationary time series analysis. Proc. Roy. Soc.Lond. A, 1998. 454: p. 903-995.

[77] Tay, P.C. AM-FM Image Analysis Using theHilbert Huang Transform. in Image Analysis andInterpretation, 2008. SSIAI 2008. IEEE SouthwestSymposium on. 2008.

[78] Caseiro, P., R. Fonseca-Pinto, and A. An-drade, Screening of obstructive sleep apnea us-ing Hilbert–Huang decomposition of oronasal air-way pressure recordings. Medical Engineering &Physics, 2010. 32(6): p. 561-568.

[79] Kaw, A. and E. Kalu, Numerical Methods with Ap-plications2010: autarkaw.

JAISCR, 2014, Vol. 4, No. 2, pp. 125

EFFECT OF ROBOT UTTERANCES USINGONOMATOPOEIA ON COLLABORATIVE LEARNING

Felix Jimenez1, Masayoshi Kanoh2,Tomohiro Yoshikawa1, Takeshi Furuhashi1 and Tsuyoshi Nakamura3

1Graduate School of Engineering, Nagoya UniversityFuro-cho, Chikusa, Nagoya, 464-8603, Aichi, Japan

2School of Engineering, Chukyo University101-2 Yagoto Honmachi, Showa-ku, Nagoya, 466-8666, Aichi, Japan

3Graduate School of Engineering, Nagoya Institute of TechnologyGokiso-cho, Showa-ku, Nagoya, 466-8555, Aichi, Japan

Abstract

We investigated the effect of robot’s utterances using onomatopoeia in collaborative learn-ing. The robot was designed to provide encouragement using onomatopoeia when stu-dents are given problems to be solved issued by a learning system. Eight college studentsused a mathematics learning system with a robot for three weeks and then took exams.The results indicated that the robot using utterances with onomatopoeia could comfortlearners more than the robot without onomatopoeia. It suggests that the robot that praisesor comforts using onomatopoeia helps learners maintain their motivation in collaborativelearning.

1 Introduction

With the growth in robot technology, morerobots are now supporting learning. For example,one robot supports the learning of students as a peertutor [1] whereas in another study, a robot helpsstudents improve their English [2]. Interaction be-tween robots and humans promotes a more realis-tic learning experience, which could lead to mak-ing students more interested in learning [3]. More-over, a robot’s recommendations are taken moreseriously than those displayed on a screen agent.For example, Shinozawa and co-workers [4] experi-mentally confirmed through quantitative evaluationthat the degree of recommendation effect firmly de-pends on the interaction environment. There resultsshowed that a three-dimensional body has an advan-tage when the interaction environment is a three-dimensional space. This suggests when a robot de-scribes an object that exists in real space to a human.

In addition, Bainbridge [5] explored how arobot’s physical or virtual presence affects uncon-scious human perception of the robot as a socialpartner. Participants collaborated on simple book-moving tasks with either a physically-present hu-manoid robot or a video-displayed robot. Each taskexamined a single aspect of interaction, i.e., greet-ings, cooperation, trust, and personal space. Partici-pants readily greeted and cooperated with the robotin both the situations. However, participants weremore likely to fulfill an unusual instruction and af-ford greater personal space to the robot in the phys-ical situation than in the video-displayed situation.Therefore, a robot’s physical presence has a benefi-cial effect on learning and problem solving.

Most studies have focused on different robot be-haviors and investigating the effects. For example,Koizumi [6] used a series of Lego-block buildingclasses run by a robot to promote spontaneous col-laboration among children. Robots not only man-

– 13110.1515/jaiscr-2015-0003

126 Jimenez F., Kanoh M., Yoshikawa T., Furuhashi T., Nakamura T.

(a) Study item (b) Study page

(c) Judgment

Answer

Next

Result

Study again

Study mistaken questions

Explanation

Question

Number of question

(d) Study result

Figure 1. Learning System

age collaborative learning between children but alsohave positive social relationships with children bypraising their efforts. These experimental resultssuggest that robots promote spontaneous collabo-ration among children and improve their enthusi-asm for learning. Moreover, Tanaka [7] reportedon a robot that can promote learning by teachingchildren. He conducted an experiment at an En-glish language school for Japanese children (4-8years old). He introduced a small humanoid robotin situations where children completed tasks issuedby their teacher. While children were completingthe task, the robot intentionally made a mistake.However, because only few studies have focused onrobot utterances, we do not know how they affectlearning and motivation.

Education studies focusing on teacher utter-ances have reported that teacher utterances affectlearners. For example, if a teacher encourages alearner faced with completing a task, the teachercan prompt the learner to increase their motiva-tion [8]. Teacher utterances using onomatopoeiahas recently gained attention. Onomatopoeia isa sensuous representation of an object, sound, orstate. It can express an object that has a clear,realistic sensation [9]. Physical education stud-

ies have suggested that teachers who instruct us-ing onomatopoeia prompt learners to learn contentand increase their motivation [10]. A study thatanalyzed teacher utterances in a nursing school re-ported that a teacher uses onomatopoeia when ex-plaining instructional content. This suggests thatonomatopoeia can stress teacher’s utterances andincrease learner motivation [11][12]. Therefore,we believe that utterances with onomatopoeia aremore effective in learning than those without ono-matopoeia. We also believe that onomatopoeia canbe used for robot utterances.

Here, we investigated the effect of a robot’sutterances with onomatopoeia on learners in col-laborative learning. We compared such utteranceswith normal utterances. The robot was designed toprovide encouragement using onomatopoeia whenlearners are faced with solving a problem issuedby a learning system. For example, when learnerssuccessfully solve a problem, the robot praises thelearner’s success by uttering, “You’re gungun (re-ally) improving.” On the other hand, when learnerscannot solve the problem, the robot comforts thelearners by uttering, “Keep up the kibikibi (good)work.”

127Jimenez F., Kanoh M., Yoshikawa T., Furuhashi T., Nakamura T.

(a) Study item (b) Study page

(c) Judgment

Answer

Next

Result

Study again

Study mistaken questions

Explanation

Question

Number of question

(d) Study result

Figure 1. Learning System

age collaborative learning between children but alsohave positive social relationships with children bypraising their efforts. These experimental resultssuggest that robots promote spontaneous collabo-ration among children and improve their enthusi-asm for learning. Moreover, Tanaka [7] reportedon a robot that can promote learning by teachingchildren. He conducted an experiment at an En-glish language school for Japanese children (4-8years old). He introduced a small humanoid robotin situations where children completed tasks issuedby their teacher. While children were completingthe task, the robot intentionally made a mistake.However, because only few studies have focused onrobot utterances, we do not know how they affectlearning and motivation.

Education studies focusing on teacher utter-ances have reported that teacher utterances affectlearners. For example, if a teacher encourages alearner faced with completing a task, the teachercan prompt the learner to increase their motiva-tion [8]. Teacher utterances using onomatopoeiahas recently gained attention. Onomatopoeia isa sensuous representation of an object, sound, orstate. It can express an object that has a clear,realistic sensation [9]. Physical education stud-

ies have suggested that teachers who instruct us-ing onomatopoeia prompt learners to learn contentand increase their motivation [10]. A study thatanalyzed teacher utterances in a nursing school re-ported that a teacher uses onomatopoeia when ex-plaining instructional content. This suggests thatonomatopoeia can stress teacher’s utterances andincrease learner motivation [11][12]. Therefore,we believe that utterances with onomatopoeia aremore effective in learning than those without ono-matopoeia. We also believe that onomatopoeia canbe used for robot utterances.

Here, we investigated the effect of a robot’sutterances with onomatopoeia on learners in col-laborative learning. We compared such utteranceswith normal utterances. The robot was designed toprovide encouragement using onomatopoeia whenlearners are faced with solving a problem issuedby a learning system. For example, when learnerssuccessfully solve a problem, the robot praises thelearner’s success by uttering, “You’re gungun (re-ally) improving.” On the other hand, when learnerscannot solve the problem, the robot comforts thelearners by uttering, “Keep up the kibikibi (good)work.”

EFFECT OF ROBOT UTTERANCES USING ONOMATOPOEIA . . .

This paper consists of five sections. The sec-ond section explains the learning system used bythe robot and learners. The third section describesthe robot used in this study. The fourth section eval-uates the involvement of the robot after describingits effect on learning, and the final section is the dis-cussion.

2 Onomatopoeia

Onomatopoeia is a generic term for an “echoicword” or “imitative word.” If you utilize Japaneseverbs including onomatopoeia, you can easily ex-press what you would like to communicate. For ex-ample, “quickly walking” or “trotting” can be ex-pressed as “sakusaku” in Japanese and “plodding”can be expressed as “tobotobo.” Such examples ofonomatopoeia use sounds that are independent oflinguistic meaning and are known as sound sym-bolism. The advantages of sound symbolism ina learning environment are that it transcends lan-guages and creates a richer impression on learn-ers than words alone. Therefore, onomatopoeia canmore fully express reality than general vocabulary.

3 Overview of learning system

We used a learning system (Fig. 1) for mathe-matical problems called “Synthetic Personality In-ventory 2 (SPI2),” which is used as a recruit-ment test for employment. The mathematical prob-lems are junior high school level such as profitor loss calculations and payment of fees. There-fore, college students did not require additionalknowledge to solve the problems. The problemsin the learning system were created by consultingthe “2014 SyuSyokukatudou no Kamisama no SPI2mondaisyu (in Japanese) [13].”

Figure 2. Apperance of Ifbot

First, learners enter their account number to login. A menu of study items is shown on the sys-tem (Fig. 1(a)). The study items are mathematicalproblems. The column from which the number ofproblems is chosen is shown under the study items.When the learner selects “20,” 20 problems are dis-played at random. When “20” is selected again, 20different problems are displayed. This is contin-ued until all the problems are completed (100 prob-lems). This enables learners to solve the problemswithin the selected study item. When the learner se-lects the study item and the number of problems, thelearning screen (Fig. 1(b)) appears and the learningprocess starts. The learner provides an answer to theproblem from the selection list. After the answer isentered, the system displays whether it is correct, asshown in Fig. 1(c). When the learner selects “Next”(Fig. 1(c)), the system moves on to the next prob-lem. When the learner selects “Result” (Fig. 1(c))or solves all the problems, the system moves on tothe results page (Fig. 1(d)). This page presents thenumber of correct and incorrect answers. When thelearner selects “Study again,” a menu of learningitems is displayed (Fig. 1(a)). When the learnerselects “Study mistaken problems,” the study pagepresents problems that were answered incorrectly(Fig. 1(b)).


(e) Happy (f) Happyexpression 1 expression 2

Figure 3. Examples of happy expression with Ifbot

(g) Unhappy (h) Unhappyexpression 1 expression 2

Figure 4. Examples of unhappy expressions withIfbot

4 Overview of robot

4.1 Robot

We used Ifbot (Fig. 2), which is a conversa-tion robot. Ifbot can be used as an English learningrobot to promote more effective learning [14]. It canalso express various expressions. We implementedthe learning system inside Ifbot and configured thelearning environment so that Ifbot and the studentcould face the monitor and learn together.

Table 1. Example of Ifbot’s utterances

Normal utterance Onomatopoeia utteranceYou’re improving. You’re gungun improving.

Praise That’s an improvement. That’s patto an improvement.You certainly did today. You certainly did balibali today.Keep up the work. Keep up the kibikibi work.

Encouragement Let’s do our best Let’s do our dondon bestKeep working on it. Keep gangan working on it.

4.2 Robot’s utterances

We examined whether learners can learn fromrobot’s utterances in collaborative learning. There-fore, the robot did not use functions that enabledits direct interaction with humans such as voicerecognition. The robot acted in accordance with thescreen of the learning system. Recent studies havereported that teacher encouragement affects learn-

ing motivation when learners solve problems [15].Moreover, an agent’s sympathy has been reportedto improve the motivation of learners [16]. There-fore, Ifbot was designed to display a happy or un-happy expression and utter phrases of encourage-ment when learners solved a problem (Fig. 1(b))and display the results (Fig. 1(c)). When learnerscould not solve the problem, Ifbot expressed sad-ness. Utterances included onomatopoeia and werecreated by consulting recent education studies [10][17].

(1) Praising motionWhen learners correctly solve a problem, therobot displays a happy expression, as shown inFigs. 3 (a) and (b), and utters, “You’re gungun(really) improving (Table 1(right)).”

(2) Encouraging motionWhen learners cannot solve a problem, the robotdisplays an unhappy expression by beginning toshed tears, as shown in Figs. 3 (a) and (b), andutters, “Let’s do our dondon (more) best (Table1(right)).”

These two motions are performed when thelearning screens (Fig. 1(c)) are shown.

Figure 5. SPI test

5 Examination

We conducted two examinations. One was toinvestigate the effect of Ifbot’s utterances usingonomatopoeia on learning. Another was to evaluateif Ifbot’s actions were able to interest the learners inlearning.


(e) Happy (f) Happyexpression 1 expression 2

Figure 3. Examples of happy expression with Ifbot

(g) Unhappy (h) Unhappyexpression 1 expression 2

Figure 4. Examples of unhappy expressions withIfbot

4 Overview of robot

4.1 Robot

We used Ifbot (Fig. 2), which is a conversa-tion robot. Ifbot can be used as an English learningrobot to promote more effective learning [14]. It canalso express various expressions. We implementedthe learning system inside Ifbot and configured thelearning environment so that Ifbot and the studentcould face the monitor and learn together.

Table 1. Example of Ifbot’s utterances

Normal utterance Onomatopoeia utteranceYou’re improving. You’re gungun improving.

Praise That’s an improvement. That’s patto an improvement.You certainly did today. You certainly did balibali today.Keep up the work. Keep up the kibikibi work.

Encouragement Let’s do our best Let’s do our dondon bestKeep working on it. Keep gangan working on it.

4.2 Robot’s utterances

We examined whether learners can learn fromrobot’s utterances in collaborative learning. There-fore, the robot did not use functions that enabledits direct interaction with humans such as voicerecognition. The robot acted in accordance with thescreen of the learning system. Recent studies havereported that teacher encouragement affects learn-

ing motivation when learners solve problems [15].Moreover, an agent’s sympathy has been reportedto improve the motivation of learners [16]. There-fore, Ifbot was designed to display a happy or un-happy expression and utter phrases of encourage-ment when learners solved a problem (Fig. 1(b))and display the results (Fig. 1(c)). When learnerscould not solve the problem, Ifbot expressed sad-ness. Utterances included onomatopoeia and werecreated by consulting recent education studies [10][17].

(1) Praising motionWhen learners correctly solve a problem, therobot displays a happy expression, as shown inFigs. 3 (a) and (b), and utters, “You’re gungun(really) improving (Table 1(right)).”

(2) Encouraging motionWhen learners cannot solve a problem, the robotdisplays an unhappy expression by beginning toshed tears, as shown in Figs. 3 (a) and (b), andutters, “Let’s do our dondon (more) best (Table1(right)).”

These two motions are performed when thelearning screens (Fig. 1(c)) are shown.

Figure 5. SPI test

5 Examination

We conducted two examinations. One was toinvestigate the effect of Ifbot’s utterances usingonomatopoeia on learning. Another was to evaluateif Ifbot’s actions were able to interest the learners inlearning.


43.0

64.6

44.6

63.4

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0Pre-test

Post-test

Score

Onomatopoeia Group Normal Group

Figure 6. Average scores for Pre and Post-test ofeach group

21.6 18.8

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

Onomatopoeia Group Normal Group

Score

n.s.

Figure 7. Average learning gains of each group

5.1 Investigating effect on learning

5.1.1 Method

This experiment was conducted to determinethe effect of Ifbot’s utterances with onomatopoeiaon learning in two groups of learners. In both thegroups, learners learned with Ifbot. However, inone group, the robot praised and comforted withonomatopoeia. This group was called the Ono-matopoeia Group. In the other group, the robotpraised or comforted without onomatopoeia. Thisgroup was called the Normal Group. Sixteen col-lege students participated in the experiment. Boththe groups consisted of eight learners. The learnerslearned mathematics on the learning system for 40minutes, three times a week for three weeks for atotal of 9 times.

5.1.2 Evaluation

The aim of the evaluation was to determinethe difference in learning gains between the Ono-

matopoeia Group and Normal Group. The learn-ing gains were calculated by subtracting the pre-testscores from the post-test scores. Each pre-test andpost-test was presented as an SPI test, as shown inFig. 5. The SPI test was based on problems in thelearning system and consisted of 95 problems. Theanalysis method involved a t-test. A significant dif-ference is permitted if the p value is under the sig-nificance level of 5%.

5.1.3 Results

The average pre-test and post-test scores areshown in Fig. 6. The average learning gains scoresare shown in7. Both Figs. 6 and 7 show the scoresof the Onomatopoeia Group on the left and thoseof the Normal Group on the right. The scores oflearners in the Onomatopoeia Group were betterthan those in the Normal Group. We also conducteda t-test to determine how effectively learners learnthe questions using the learning gains scores of eachgroup, as shown in Fig. 7. The results indicate thatthere was no significant difference (t = 0.3,d f =14, p = 0.37). Therefore, there was no differencein the effect on learning between the OnomatopoeiaGroup and Normal Group.

5.2 Examination to evaluate robot’s action

5.2.1 Method

The robot’s action was evaluated using the se-mantic differential scale method (SD method) [18].The SD method is used to evaluate the meaningof objects and concepts. Recently, the SD methodhas been used in robotics. For example, Ogata[19] used the SD method for evaluating the interac-tion between robots and humans. Kanda [20] usedthe SD method involving 28 adjectives for psycho-logical evaluation experiments on robotic interac-tion. We used the SD method involving the follow-ing four adjectives, “approach?able,” “sociable,”“fulfilling,” and “pleasurable.” The SD method isshown in Fig. 8. The evaluation values are de-fined in the top left part as “-3” and increase by oneas they progress to the right. We used the Mann-Whitney U-test. A significant difference is permit-ted if the p value is under the significance level of5%.


unpleasant pleasant

good veryratherrather neither goodvery

Figure 8. SD method used in this study

Table 2. SD method results

Adjective Onomatopoeia Group Normal Groupunpleasant pleasant 1.38(±1.1) 0.13(±0.8)

stuffy sociable 1.25(±1.1) 0.25(±1.5)depression fulfilling 0.63(±0.5) −0.75(±1.2)

unapproachable approachable 0.75(±1.1) 1.75(±1.0)

5.2.2 Results

The average evaluation values of each group arelisted in Table 2, and the analysis results are listedin Table 3. The results indicate that the valuesof learners in the Onomatopoeia Group were bet-ter than those in the Normal Group for “sociable,”“pleasurable,” and “fulfilling,” whereas the valuesof “approachable” for the learners in the NormalGroup were better than those in the OnomatopoeiaGroup. The Mann-Whitney U-test results indicatethat there was a significant difference between theOnomatopoeia Group and Normal Group in the cri-teria of “pleasurable” and “fulfilling.” Therefore,the learners in the Onomatopoeia Group were morefulfilled than those in the Normal Group.

Table 3. Result of analysis

Adjective U p valueunpleasant pleasant 17 0.02

stuffy sociable 20 0.19depression fulfilling 10 0.01

unapproachable approachable 11.5 0.09

6 Discussion

The results suggest that our robot encourageslearners. However, there was no difference in learn-ing between utterances using onomatopoeia andnormal utterances.

Recent education studies in which teachers usedonomatopoeia have suggested that onomatopoeiacan help stress teacher’s utterances [11] [12]. Webelieve that the same result is possible with robots.

The learning period in our study was short, only

three weeks, which is one possible reason that therewas no difference in the effect on learning betweenutterances using onomatopoeia and normal utter-ances. Recent education studies have shown thatit takes time for increase in motivation to be re-flected in the learning of students [21]. However,we found that the learning gains of learners in theOnomatopoeia Group were greater than those in theNormal Group, as shown in Fig. 7.

7 Conclusion

We investigated the effect of robot’s utterancesusing onomatopoeia on learners in collaborativelearning. We evaluated the effect of utterances us-ing onomatopoeia by comparing them with nor-mal utterances. The robot was designed to praiseor comfort with onomatopoeia when learners werefaced with solving problems issued by a learn-ing system. For example, when learners correctlysolved a problem, the robot praised the learnersby uttering, “You’re Gungun (really) improving.”When learners could not solve a problem, the robotcomforted the learners by uttering, “Keep up theKibikibi (good) work.”

These results suggest that the robot encouragedlearners. However, there was no difference in theeffect on learning between groups where utterancesusing onomatopoeia and normal utterances wereused.

We are currently developing a robot that praisesor comforts using adjectives and adverbs for com-paring the effect on learning between utteranceswith and without onomatopoeia. We also plan toconduct a long-term experiment.

References[1] T. Kanda, T. Hirano, D. Eaton and H. Ishiguro :

“Interactive robots as social partners and peer tu-tors for children: A field trial,” Hum-Comout. In-teract., vol.10, No.1, pp.61-84, 2004.

[2] O.H. Kwon, S.Y. Koo, Y.G. Kim and D.S. Kwon:“Telepresence robot system for english tutoring,”IEEE Workshop on Advanced Robotics and its So-cial Impacts, pp.152-155, 2010.

[3] T. Kand: “How a communication Robot Can Con-tribute to Education (in Japanese),” Journal of


unpleasant pleasant

good veryratherrather neither goodvery

Figure 8. SD method used in this study

Table 2. SD method results

Adjective Onomatopoeia Group Normal Groupunpleasant pleasant 1.38(±1.1) 0.13(±0.8)

stuffy sociable 1.25(±1.1) 0.25(±1.5)depression fulfilling 0.63(±0.5) −0.75(±1.2)

unapproachable approachable 0.75(±1.1) 1.75(±1.0)

5.2.2 Results

The average evaluation values of each group arelisted in Table 2, and the analysis results are listedin Table 3. The results indicate that the valuesof learners in the Onomatopoeia Group were bet-ter than those in the Normal Group for “sociable,”“pleasurable,” and “fulfilling,” whereas the valuesof “approachable” for the learners in the NormalGroup were better than those in the OnomatopoeiaGroup. The Mann-Whitney U-test results indicatethat there was a significant difference between theOnomatopoeia Group and Normal Group in the cri-teria of “pleasurable” and “fulfilling.” Therefore,the learners in the Onomatopoeia Group were morefulfilled than those in the Normal Group.

Table 3. Result of analysis

Adjective U p valueunpleasant pleasant 17 0.02

stuffy sociable 20 0.19depression fulfilling 10 0.01

unapproachable approachable 11.5 0.09

6 Discussion

The results suggest that our robot encourageslearners. However, there was no difference in learn-ing between utterances using onomatopoeia andnormal utterances.

Recent education studies in which teachers usedonomatopoeia have suggested that onomatopoeiacan help stress teacher’s utterances [11] [12]. Webelieve that the same result is possible with robots.

The learning period in our study was short, only

three weeks, which is one possible reason that therewas no difference in the effect on learning betweenutterances using onomatopoeia and normal utter-ances. Recent education studies have shown thatit takes time for increase in motivation to be re-flected in the learning of students [21]. However,we found that the learning gains of learners in theOnomatopoeia Group were greater than those in theNormal Group, as shown in Fig. 7.

7 Conclusion

We investigated the effect of robot’s utterancesusing onomatopoeia on learners in collaborativelearning. We evaluated the effect of utterances us-ing onomatopoeia by comparing them with nor-mal utterances. The robot was designed to praiseor comfort with onomatopoeia when learners werefaced with solving problems issued by a learn-ing system. For example, when learners correctlysolved a problem, the robot praised the learnersby uttering, “You’re Gungun (really) improving.”When learners could not solve a problem, the robotcomforted the learners by uttering, “Keep up theKibikibi (good) work.”

These results suggest that the robot encouragedlearners. However, there was no difference in theeffect on learning between groups where utterancesusing onomatopoeia and normal utterances wereused.

We are currently developing a robot that praisesor comforts using adjectives and adverbs for com-paring the effect on learning between utteranceswith and without onomatopoeia. We also plan toconduct a long-term experiment.

References[1] T. Kanda, T. Hirano, D. Eaton and H. Ishiguro :

“Interactive robots as social partners and peer tu-tors for children: A field trial,” Hum-Comout. In-teract., vol.10, No.1, pp.61-84, 2004.

[2] O.H. Kwon, S.Y. Koo, Y.G. Kim and D.S. Kwon:“Telepresence robot system for english tutoring,”IEEE Workshop on Advanced Robotics and its So-cial Impacts, pp.152-155, 2010.

[3] T. Kand: “How a communication Robot Can Con-tribute to Education (in Japanese),” Journal of


Japanese Society for artificial Intelligence, Vol.23,No.2, pp.229-236, 2008.

[4] K.Shinozawa, F.Naya, J.Yamato and K.Kogure:“Differences in effect of robot and screen agentrecommendations on human decision-making,” In-ternational Journal of Human-Computer Studies,Vol.62, No.2, pp.267-279, 2005.

[5] W.A. Bainbridge, J. Hart, E.S. Kim and B. Scassel-lati: “The effect of presence on human-robot inter-action,” IEEE International Symposium on Robotand Human Interactive Communication, pp.701-706, 2008.

[6] S. Koizumi, T. Kanda and T. Miyashita: “Col-laborative learning experiment with social robot(in Japanese),” Journal of the Robotics Society ofJapan,Vol.29, No.10, pp.902-906, 2011.

[7] F. Tanaka: “Social robotics research and its appli-cation at early childhood education (in Japanese),”Journal of the Robotics Society of Japan,Vol.29,No.1, pp.19-22, 2011.

[8] H. Namiki: “Kotobakake no kouka no zikkenkekkakara (in Japanese),” Child psychology, Vol.47,No.5, pp.474-477, 1993.

[9] T. Komatsu and H. Akiyama: “Expression systemof onomatopoeias for assisting users’ intuitive ex-pressions (in Japanese),” The Journal of the In-stitute of Electronics, Information and Communi-cation Engineers A,Vol.J92-A, No.11, pp.752-763,2009.

[10] Y.Fujino, M.Kikkawa and Y.Sagisaka: “A collec-tion of onomatopeias in japan sports (in Japanese),”Proc Oriental Cocosda, pp.160-164, 2003.

[11] M. Takano and M. Udo: “Onomatopoeia inthe utterances of teachers at special schools (inJapanese),” The Japanese Association of SpecialEducation, Vol.48, No.2, pp.75-84, 2010.

[12] M. Takano and M. Udo: “Contribution of ono-matopoeia to educational support for children withsevere mental retardation (in Japanese),” The Jour-nal of School Education,Vol.19, pp.27-37,2007.

[13] H. Shinagawa: “2014 SyuSyokukatudou noKamisama no SPI2 mondaisyu (in Japanese),” U-CAN-pub, 2012.

[14] F.Jimenez and M.Kanoh: “Robot that can promotelearning by observing in collaborative learning,”IEEE International Conference on Systems, Man,and Cybernetics, 2013.

[15] E.B.Hurlock: “An evaluation of certain incentivesused in school work,” Journal of Education Psy-chology, Vol.16, pp.145-150, 1925.

[16] H. Nakajima, Y. Moroshima, R. Yamada, S.Kawaji, S. Brave, H. Maldonado and C. Nass:“Social Intelligence in a Human-Machine Col-laboration System : Social Responses of Agentswith Mind Model and Personality (in Japanese),”Journal of Japanese Society for Artificial Intelli-gence,Vol.19, No.3, pp.184-196, 2004.

[17] M.Ono: “Giongo Gitaigo 4500 Nihongo Ono-matope Ziten (in Japanese),” Syougakkan, 2007.

[18] C.E.Osgood: “Studies on the generality of af-fective meaning systems,” American Psychologist,Vol.17, pp.10-28, 1962.

[19] T. Ogata and S. Sugano: “Experimental Evaluationof the Emotional Communication between Robotsand Humans,” ISCIE Journal ’Systems, Controland Information,Vol.13, No.12, pp.566-574, 2000.

[20] T. Kanda, H. Ishiguro and T. Ishida: “Psychologi-cal Evaluation on Interactions between People andRobot (in Japanese),” Journal of the Robotics So-ciety of Japan, Vol.19, No.3, pp.362-371, 2001.

[21] Y. Miyake and N. Miyake : “Pedagogical psychol-ogy (in Japanese),” Foundation for the Promotionof The Open University of Japan, 2012.

JAISCR, 2014, Vol. 4, No. 2, pp. 133

AUTOMATED APPROACH TO CLASSIFICATION OFMINE-LIKE OBJECTS USING MULTIPLE-ASPECT

SONAR IMAGES

Xiaoguang Wang1, Xuan Liu1, Nathalie Japkowicz2 and Stan Matwin3

1Faculty of Computer Science Dalhousie University, Canadae-mail:{x.wang; xuan.liu}@dal.ca

2School of Electrical Engineering & Computer Science University of Ottawa, [email protected]

3Faculty of Computer Science, Dalhousie University, CanadaInstitute of Computer Science, Polish Academy of Sciences, Poland

e-mail: [email protected]

Abstract

In this paper, the detection of mines or other objects on the seabed from multiple side-scansonar views is considered. Two frameworks are provided for this kind of classification.The first framework is based upon the Dempster–Shafer (DS) concept of fusion from asingle-view kernel-based classifier and the second framework is based upon the conceptsof multi-instance classifiers. Moreover, we consider the class imbalance problem whichis always presents in sonar image recognition. Our experimental results show that bothof the presented frameworks can be used in mine-like object classification and the pre-sented methods for multi-instance class imbalanced problem are also effective in suchclassification.

1 Introduction

To acquire high-resolution sonar imagery forthe detection of mine like objects (MLO) and otherobjects of interest on the seabed, side-scan sonarequipped vehicles such as Autonomous Underwa-ter Vehicles (AUVs) are frequently used by militaryforces or commercial organizations. For this pur-pose, Automatic Target Recognition (ATR) meth-ods have been successfully applied to detect possi-ble objects or regions of interest in sonar imagery[1]-[9]. Since many of the sonar images are of thesame object from different sonar passes, there aremultiple views of the same object at different rangesand aspect of the sonar. It is anticipated that the ad-ditional information obtained from additional viewsat an object should improve the classification per-formance over single-aspect classification. Recent

researches [2][4][5] prove this anticipation by ex-perimental result and find that although it is possi-ble to obtain an accurate classification based upona single image of an object, misclassifications canbe reduced if the detection is based upon multipleviews of the object.

In this paper, we use two methods combinedwith data fusion methods and multi-instance classi-fication methods to deal with the class imbalancedproblem in multi-view MLO classification. The firstmethod is the cost-sensitive boosting algorithm [48]and the second is a classifier-independent method:over-sampling of multi-views of the minority class.

The remainder of the paper is organized asfollows: Section II discusses previous work thathas been done both in multi-view based classifica-tion and on the class imbalance problem. Section

– 14810.1515/jaiscr-2015-0004

134 Wang X., Liu X., Japkowicz N. and Matwin S.

III presents the fusion methodologies. Section IVpresents the data preprocessing method used in ourresearch. In section V, we consider the classificationperformance of all admissible multiple aspects in-cluding double and triple aspects for different typesof mine like objects by studying the correct clas-sification rates and the error rates as functions ofthe angular difference between aspects. In sectionVI, the class imbalance problem for single aspectmine countermeasure missions (MCM) datasets anda novel solution are presented. In section VII, theclass imbalance problem for multi aspects MCMdatasets and related concepts are presented and wepresent a novel cost-sensitive AdaBoost algorithmfor this problem. This section also illustrates theefficiency of our algorithm as determined by exper-imentation, and offers some final remarks. Finally,Section VIII presents the conclusion, followed bythe references.

2 Previous work

In these works, B. Zerr et al. [1] [3] firstly de-scribed a method to estimate the three-dimensionalaspects of underwater objects using a sequence ofsonar images. The sonar images are segmented intothree kinds of regions: echo, shadow and back-ground. A study they [2] conducted using sonarimages of various objects and height profiles asfeatures showed that the highest classification per-formance when imaging an object twice can beachieved with an angular increment of 90 degreesbetween the two images. M. Couillard et al. [4]extended this study and considered the classifica-tion performance of all admissible secondary as-pects for different types of mine like objects bystudying the correct classification rates and the errorrates as functions of the angular difference betweenaspects. In their work, two different approacheshave been used to combine multiple images of anobject. The first one creates a new object for clas-sification by combining the features of the two im-ages to single vector. The second approach is sim-ply to fuse the single aspect classification probabil-ities obtained from the classifier according to thedesired angular increment between the images.

J. Fawcett et al. [5] investigated two approachesfor fusing multiple views: fuse-feature and fuse-classification. In the first approach the two fea-

ture sets taken at different aspects were combinedto form a large feature vector. Then a kernel basedclassifier was trained with this feature vector. Inthe second approach, they fused two individual-aspect classifications of two feature vectors usingthe Dempster-Shafer (DS) theory, which has fre-quently been used as an alternative to Bayesian the-ory and fuzzy logic for data fusion.

S. Reed et al. [6] [7] have also investigated theclassification of a target by fusing several views us-ing DS theory. They present a model to extend thestandard mine/not-mine classification procedure toprovide both shape and size information on the ob-ject. The difference between their work and oth-ers is that they generated the mass functions usinga fuzzy functions membership algorithm based onfuzzy logic.

V. Myers and D. P. Williams [8] [9] introduceda model for classifying targets in sonar imagesfrom multiple views by using a partially observableMarkov decision process (POMDP). This POMDPmodel allows one to adaptively determine which ad-ditional views of an object would be most beneficialin reducing the classification uncertainty.

In other related work, G. Dobeck fused multi-ple images from different frequency bands [11], J.Tucker et al. [12] fused multiple images from dif-ferent platforms, and M. Azimi-Sadjadi et al. [13]fused multiple images from multi-aspect target echoclassification.

These works have one common point in thatthey all use fusion methods to combine differentviews for classification. Although using fusionmethods such as Dempster-Shafer fusion of singleaspect classification results was shown to be effec-tive in some cases [2][4][5], we can still anticipate anumber of challenges and limitations in some ATRapplication using fusion methods[15]. It is thus nec-essary to develop other methods to combine dif-ferent information from multiple views in this re-search.

In this paper, including the data fusion method-ology, we present two frameworks for multi-aspectclassification on side scan sonar images. The firstone uses the Dempster-Shafer (DS) theory on multi-ple views of target which is not quite different fromthe methods mentioned [2][4][5]. In the secondframework we use multi-instance method which is

135Wang X., Liu X., Japkowicz N. and Matwin S.

III presents the fusion methodologies. Section IVpresents the data preprocessing method used in ourresearch. In section V, we consider the classificationperformance of all admissible multiple aspects in-cluding double and triple aspects for different typesof mine like objects by studying the correct clas-sification rates and the error rates as functions ofthe angular difference between aspects. In sectionVI, the class imbalance problem for single aspectmine countermeasure missions (MCM) datasets anda novel solution are presented. In section VII, theclass imbalance problem for multi aspects MCMdatasets and related concepts are presented and wepresent a novel cost-sensitive AdaBoost algorithmfor this problem. This section also illustrates theefficiency of our algorithm as determined by exper-imentation, and offers some final remarks. Finally,Section VIII presents the conclusion, followed bythe references.

2 Previous work

In these works, B. Zerr et al. [1] [3] firstly de-scribed a method to estimate the three-dimensionalaspects of underwater objects using a sequence ofsonar images. The sonar images are segmented intothree kinds of regions: echo, shadow and back-ground. A study they [2] conducted using sonarimages of various objects and height profiles asfeatures showed that the highest classification per-formance when imaging an object twice can beachieved with an angular increment of 90 degreesbetween the two images. M. Couillard et al. [4]extended this study and considered the classifica-tion performance of all admissible secondary as-pects for different types of mine like objects bystudying the correct classification rates and the errorrates as functions of the angular difference betweenaspects. In their work, two different approacheshave been used to combine multiple images of anobject. The first one creates a new object for clas-sification by combining the features of the two im-ages to single vector. The second approach is sim-ply to fuse the single aspect classification probabil-ities obtained from the classifier according to thedesired angular increment between the images.

J. Fawcett et al. [5] investigated two approachesfor fusing multiple views: fuse-feature and fuse-classification. In the first approach the two fea-

ture sets taken at different aspects were combinedto form a large feature vector. Then a kernel basedclassifier was trained with this feature vector. Inthe second approach, they fused two individual-aspect classifications of two feature vectors usingthe Dempster-Shafer (DS) theory, which has fre-quently been used as an alternative to Bayesian the-ory and fuzzy logic for data fusion.

S. Reed et al. [6] [7] have also investigated theclassification of a target by fusing several views us-ing DS theory. They present a model to extend thestandard mine/not-mine classification procedure toprovide both shape and size information on the ob-ject. The difference between their work and oth-ers is that they generated the mass functions usinga fuzzy functions membership algorithm based onfuzzy logic.

V. Myers and D. P. Williams [8] [9] introduceda model for classifying targets in sonar imagesfrom multiple views by using a partially observableMarkov decision process (POMDP). This POMDPmodel allows one to adaptively determine which ad-ditional views of an object would be most beneficialin reducing the classification uncertainty.

In other related work, G. Dobeck fused multi-ple images from different frequency bands [11], J.Tucker et al. [12] fused multiple images from dif-ferent platforms, and M. Azimi-Sadjadi et al. [13]fused multiple images from multi-aspect target echoclassification.

These works have one common point in thatthey all use fusion methods to combine differentviews for classification. Although using fusionmethods such as Dempster-Shafer fusion of singleaspect classification results was shown to be effec-tive in some cases [2][4][5], we can still anticipate anumber of challenges and limitations in some ATRapplication using fusion methods[15]. It is thus nec-essary to develop other methods to combine dif-ferent information from multiple views in this re-search.

In this paper, including the data fusion method-ology, we present two frameworks for multi-aspectclassification on side scan sonar images. The firstone uses the Dempster-Shafer (DS) theory on multi-ple views of target which is not quite different fromthe methods mentioned [2][4][5]. In the secondframework we use multi-instance method which is

AUTOMATED APPROACH TO CLASSIFICATION OF MINE-LIKE . . .

a methodology for a combination of the informationof multiple views of target.

On the other hand, when applying ATR meth-ods to detect possible MLOs, the number of natu-rally occurring clutter objects (such as rocks, ship-wrecks or fish) that are detected always typicallyfar outweighs the relatively rare event of detectinga mine. This means that the number of non-minelike objects is always much greater than the numberof mine like objects. In this situation, the datasetis “imbalanced”. A dataset is imbalanced if theclasses are not approximately equally represented.In imbalanced datasets, the number of one class isoften much higher than the number of classes anda default classifier always predicts “the majorityclass”. For MLO classification, no matter whetherwe make the classification based upon a single im-age of an object or multiple images of an object,the training data sets are always class imbalanced.Our research shows that in both the cases of learn-ing from single-view or multi-views of the objects,the performance of classifiers always suffered fromthe class imbalance problem.

For Automatic Target Recognition (ATR) meth-ods used on MCM data sets, D. Williams et al.[10] used infinitely imbalanced logistic regressionto solve the class imbalanced problem. That is theonly work related to the class imbalanced problemof MLOs classification, especially in the case ofmulti aspects class imbalanced problem of MLOsclassification.

3 Fusion methodologies

Data fusion is a technology which collates in-formation from different sources considering thesame scene in an attempt to provide a more com-plete description. When we try to combine multi-aspects sonar images for classification, the mostcommon numerical fusion techniques used areBayesian probability theory, Fuzzy systems andDempster-Shafer theory.

Fuzzy systems contain a wealth of possible fu-sion operators. However, many of the operators arenon-associative and the choice of operators is casedependent, which means the order in which the in-formation is fused has an impact on the final re-sult. Bayesian and Dempster-Shafer models have

both been successfully applied but Dempster-Shafertheory provides some features that Bayesian theorydoes not. One of the most significant features is thatDempster-Shafer theory can consider the union ofclasses. This feature is used to improve the separa-bility of different classes. Therefore the Dempster-Shafer (DS) method is a popular data fusion methodwhich has been used by other authors for side scansonar image classification.

The Dempster-Shafer method is based on twoideas: obtaining degrees of belief for one questionfrom subjective probabilities for a related question,and Dempster’s rule for combining such degrees ofbelief when they are based on independent items ofevidence.

The Dempster’s rule of combination is a purelyconjunctive operation (AND). The combinationrule results in a belief function based on conjunc-tive pooled evidence. This rule can also be used formulti aspect classification.

In DS theory, each unique class makes upa set called the frame of discernment θ ={ω1,ω2, . . . ,ωM}. Belief is attributed to hypothe-ses within the power set through a basic probabilityassignment, called the mass function m(A).

Suppose that we have two views of target S1 andS2 and the mass functions m1(S)1 and m2(S)2. Baseon the Despster’s rule, the mass after fusion for theset A is:

m12 (A) =∑S1∩S2=A m1(S1)m2(S2)

1−∑S1∩S2= /0 m1(S1)m2(S2)(1)

The classification rule for this case is

g(x1,x2) = argmaxi

m12(ωi) (2)

In our research, as many authors, we useDempster-Shafer theory as a choice for multi-aspectclassification. In our algorithm, we use a trainingdataset for the single-aspect classifier and then savethe predicted class labels from the testing data. Us-ing T cross validation we can get a T ×M outputmatrix. Let βi (k) ,k = 1,2, . . . ,T , correspond to theith column of the prediction vector for the kth test-ing feature vector.

For n output vectors βi (k) , i = 1,2, . . .n ob-tained from n single-aspect classifications, the n


sets of masses are finally fused using Demsper’srule and the final decision is given by the classifi-cation rule g(x1,x2,..,xn).

In training datasets of MLO classification, eachobject has more than one view and each view issaved as an instance in the dataset. Therefore eachobject has a group of instances which has the samelabel. We call this group of instances a “bag”. Bagis a term originally used in multi-instance learningwhich will be discussed in next section. In this pa-per, the n sets of masses, which are also n bags, arefused using Demsper’s rule to get the final decision.

4 Multi-instance methodologies

Multi-instance learning (MIL) is another frame-work choice for multi-aspect classification. MIL isconcerned with supervised learning but differs fromnormal supervised learning in two points: (1) it hasmultiple instances in an example, and (2) only oneclass label is observable for all the instances in anexample.

The multiple instances learning problem can de-fined as:

Given:

– a set of bags Bi, i = 1, . . . ,N, their classificationc(B)i{0,1}, and the instances ei j( j = 1, . . . ,ni)belonging to each bag.

– the existence of an unknown function f thatclassifies individual instances as 1 or 0, andfor which it holds that c(Bi) = 1 if and onlyif there exists ei jBi : f (ei j) = 1 (multi-instanceconstraint, MIC)

In our experiment we choose two popular multi-instance learning algorithms: the decision tree andthe logistic regression methods.

4.1 Multi-instance Tree

Similar to a single-instance decision tree (likeC4.5), the multi-instance tree is based on the infor-mation gain of a feature of the instance, the dif-ference of the multi-decision tree and the single-decision tree is that instead of using the feature ofone instance to develop the information gain, thegrowing of a multi-instance tree is based on the in-formation gain of a feature to set of instances. The

concept of information gain and entropy are ex-tended to bags of instances in the MIL framework.Suppose S is a collection of instances which be-long to p(S) positive bags and n(S) negative bags,F is the feature being considered as the splitting cri-terion and Sn is the collection of instances whosevalue of feature F is n. The extended informationgain and entropy are defined as (3) and (4):

In this paper we use the multi-instance tree in-ducer (MITI) proposed by Blockeel et al. [16]. Itimplements the top-down decision tree learning ap-proach known from propositional tree inducers suchas C4.5 [17], with two key modifications: (a) nodesare expanded in best-first order guided by a heuris-tic that aims to identify pure positive leaf nodes asquickly as possible, and (b) whenever a pure posi-tive leaf node is created, all positive bags containinginstances in this leaf node are deactivated.

4.2 Multi-instance Logistic Regression(MILR)

For single-instance classification, Logistic Re-gression [49] assumes a parametric form for the dis-tribution Pr (Y |X) , then directly estimates its pa-rameters from the training data. The parametricmodel assumed by Logistic Regression in the casewhere Y is a boolean is:

Pr (Y = 1 | X) =1

1+exp(ω0+∑ni=1 ωiXi)

(5)

and

Pr (Y = 0 | X) =exp(ω0+∑n

i=1 ωiXi)


(6)

However, the standard logistic regression model[49] does not apply to multi-instance data becausethe instances’ class labels are masked by the “col-lective” class label of a bag. X. Xu and E. Frank[14] use a two-stage framework to upgrade linearlogistic regression and boosting to MI data.

The instance-level class probabilities are givenby

Pr (y = 1 | x) =1

1+ exp(−βx)

andPr (y = 0 | x) =

11+ exp(βx)


sets of masses are finally fused using Demsper’srule and the final decision is given by the classifi-cation rule g(x1,x2,..,xn).

In training datasets of MLO classification, eachobject has more than one view and each view issaved as an instance in the dataset. Therefore eachobject has a group of instances which has the samelabel. We call this group of instances a “bag”. Bagis a term originally used in multi-instance learningwhich will be discussed in next section. In this pa-per, the n sets of masses, which are also n bags, arefused using Demsper’s rule to get the final decision.

4 Multi-instance methodologies

Multi-instance learning (MIL) is another frame-work choice for multi-aspect classification. MIL isconcerned with supervised learning but differs fromnormal supervised learning in two points: (1) it hasmultiple instances in an example, and (2) only oneclass label is observable for all the instances in anexample.

The multiple instances learning problem can de-fined as:

Given:

– a set of bags Bi, i = 1, . . . ,N, their classificationc(B)i{0,1}, and the instances ei j( j = 1, . . . ,ni)belonging to each bag.

– the existence of an unknown function f thatclassifies individual instances as 1 or 0, andfor which it holds that c(Bi) = 1 if and onlyif there exists ei jBi : f (ei j) = 1 (multi-instanceconstraint, MIC)

In our experiment we choose two popular multi-instance learning algorithms: the decision tree andthe logistic regression methods.

4.1 Multi-instance Tree

Similar to a single-instance decision tree (likeC4.5), the multi-instance tree is based on the infor-mation gain of a feature of the instance, the dif-ference of the multi-decision tree and the single-decision tree is that instead of using the feature ofone instance to develop the information gain, thegrowing of a multi-instance tree is based on the in-formation gain of a feature to set of instances. The

concept of information gain and entropy are ex-tended to bags of instances in the MIL framework.Suppose S is a collection of instances which be-long to p(S) positive bags and n(S) negative bags,F is the feature being considered as the splitting cri-terion and Sn is the collection of instances whosevalue of feature F is n. The extended informationgain and entropy are defined as (3) and (4):

In this paper we use the multi-instance tree in-ducer (MITI) proposed by Blockeel et al. [16]. Itimplements the top-down decision tree learning ap-proach known from propositional tree inducers suchas C4.5 [17], with two key modifications: (a) nodesare expanded in best-first order guided by a heuris-tic that aims to identify pure positive leaf nodes asquickly as possible, and (b) whenever a pure posi-tive leaf node is created, all positive bags containinginstances in this leaf node are deactivated.

4.2 Multi-instance Logistic Regression(MILR)

For single-instance classification, Logistic Re-gression [49] assumes a parametric form for the dis-tribution Pr (Y |X) , then directly estimates its pa-rameters from the training data. The parametricmodel assumed by Logistic Regression in the casewhere Y is a boolean is:

Pr (Y = 1 | X) =1


(5)

and

Pr (Y = 0 | X) =exp(ω0+∑n

i=1 ωiXi)


(6)

However, the standard logistic regression model[49] does not apply to multi-instance data becausethe instances’ class labels are masked by the “col-lective” class label of a bag. X. Xu and E. Frank[14] use a two-stage framework to upgrade linearlogistic regression and boosting to MI data.

The instance-level class probabilities are givenby

Pr (y = 1 | x) =1

1+ exp(−βx)

andPr (y = 0 | x) =

11+ exp(βx)


Entropymulti (S) =− p(S)p(s)+n(S)

× log2

(p(S)

p(s)+n(S)

)− n(S)

p(s)+n(S)× log2

(n(S)

p(s)+n(S)

)(3)

In f oGainmulti (S,F) = Entropymulti (S)− ∑n∈Values(F)

p(sn)+n(sn)

p(s)+n(S)×Entropymulti (sn) (4)

respectively, where β is the parameter vector to beestimated.

Given a bag b with n instances xi ∈ b, we as-sume that the bag-level class probability is eithergiven by

Pr (Y | b) =1n

n

∑i=1

Pr (y | xi) (7)

or by

logPr (y = 1 | b)Pr (y = 0 | b)

=1n

n

∑i=1

logPr (y = 1 | xi)

Pr (y = 0 | xi)(8)

From (6) we can get (9) and (10).

Based on (9) and (12) we can estimate the pa-rameter vector β by maximizing the bag-level bi-nomial log-likelihood function (11) where N is thenumber of bags.

As usual, the maximization of the log-likelihood function is carried out via numeric op-timization because there is no direct analytical so-lution. The optimization problem can be solvedvery efficiently because we are working with a lin-ear model.

5 Class imbalance problem inMulti-Views MLO classification

For classification on single-views of mine likeobject (MLO) detection, we can apply many exist-ing approaches such as sampling methods [27] [31]or cost-sensitive classification methods [29] [34][36]. For classification on multi-views of MLO de-tection, to our knowledge there are very few discus-sions related to the multi-instance class imbalancedproblems.

For the single-instance data imbalance problem,the machine learning community has addressed theissue of class imbalances in two different ways tosolve the skewed vector space problem. The first

method, which is classifier-independent, is to bal-ance the distributions by considering the represen-tative proportions of class examples in the distri-bution of the original data. The simplest way tobalance a dataset is to under-sample or over-sample(randomly or selectively) the majority class, whilemaintaining the original minority class population[34]. One of the most common pre-processingmethods to balance a dataset, Synthetic Minor-ity Over-sampling Technique (SMOTE) [31], over-samples the minority class by taking each minor-ity class sample and introducing synthetic exam-ples along the line segments joining any or all ofthe k minority class nearest neighbors. Evidenceshows that synthetic sampling methods are effec-tive when dealing with learning from imbalanceddata [27] [31] [34].

Working with classifiers to adapt datasets isanother way to deal with the single-instance im-balanced data problem. The theoretical foun-dation and algorithms of cost-sensitive methodsnaturally apply to imbalanced learning problems[29][30]. Thus, for imbalanced learning domains,cost-sensitive techniques provide a viable alter-native to sampling methods. Recent research[27][29][34] suggests that assigning distinct coststo the training examples is a fundamental approachof this type, and various experimental studies of this[23][25][36] have been performed using differentkinds of classifiers.

The work of [48] provides a cost-sensitiveboosting algorithm for imbalanced multi-instanceclassification. This algorithm makes modificationsbased on the original Adaboost algorithm [19] forimbalanced multi-instance datasets.

The original AdaBoost [19] iteratively updatesthe distribution function over the training data. Thismeans that for every iteration t = 1, . . . ,T , where Tis a given number of the total number of iterations,the distribution function Dt is updated sequentially,and used to train a new hypothesis:


Pr (y = 1 | b) =[∏n

i Pr (y = 1 | xi)]1n

[∏ni Pr (y = 1 | xi)]

1n +[∏n

i Pr (y = 0 | xi)]1n=

exp(1

n β∑i xi)

1+ exp(1

n β∑i xi) (9)

Pr (y = 0 | b) =[∏n

i Pr (y = 0 | xi)]1n

[∏ni Pr (y = 1 | xi)]

1n +[∏n

i Pr (y = 0 | xi)]1n=

11+ exp

(1n β∑i xi

) (10)

LL =N

∑i=1

[yilogPr (y = 1 | b)+(1− yi)logPr (y = 0 | b)] (11)

Dt+1 (i) =Dt (i) exp(−αtyiht(xi))

Zt(12)

whereαt =12 ln

(1−αt

αt

)is the weight updating

parameter, ht(xi) is the prediction output of hy-pothesis ht on the instance xi, εt is the error ofhypothesis ht over the training data, and Zt is a nor-malization factor. Here each xi is an n-tuple of at-tribute values belonging to a certain domain or in-stance space X, and yi is a label in a label set Y.

Schapire and Singer [24] used a generalizedversion of Adaboost. As shown in [24], the train-ing error of the final classifier is bounded as:

1m|{i : H(xi) = yi}| ≤ ∏

tZt (13)

where

Zt = ∑i

Dt (i)exp(−αtyiht (xi))

≤ ∑i

Dt (i)(

1+ yiht (xi)

2e−α +

1+ yiht (xi)

2eα)

(14)Minimizing Zt on each round, αt is induced as:

αt =12

ln

(∑i,yi=ht(xi) Dt (i)

∑i,yi =ht(xi) Dt (i)

)(15)

The weighting strategy of AdaBoost identifies sam-ples on their classification outputs as correctly clas-sified or misclassified. However, it treats samples ofdifferent classes equally. The weights of misclassi-fied samples from different classes are increased byan identical ratio, and the weights of correctly clas-sified samples from different classes are decreasedby an identical ratio.

Figure 1. Cost-sensitive Adaboost forMulti-Instance Learning Algorithm

Since boosting is suitable for cost-sensitiveadaption, motivated by [6]’s analysis and methodsfor choosing αt, and several cost-sensitive boost-ing methods [30] [36] [29] for imbalanced sin-gle instance learning have been proposed in recentyears. The work of [48] applied cost-minimizingtechniques to the combination schemes of ensem-ble methods for imbalanced multi-instance datasets.This learning objective expects that the weightingstrategy of a boosting algorithm will preserve aconsiderable weighted sample size of the minorityclass. A preferred boosting strategy is one that candistinguish different types of samples, and boostmore weights on those samples associated withhigher identification importance.

To denote the different identification impor-tance among bags, each bag is associated with a costitem. For an imbalanced multi-instance dataset,there are many more bags with class label y =−1than bags with class label y =+1. Using the same

4

given: A multi-instance training dataset with a set of bags χi, i = 1, … , N, where each bag can consist of an arbitrary number of instances and a given label: χi = �xi1, xi2, … , xi

ni; yi�, i = 1, … , N, yi ∈ {−1, +1}, and each instance xi

ni is an M-tuple of attribute values belonging to a certain domain or instance space ℝ.

initialize 𝐷𝐷1(𝑖𝑖) = 1/𝑚𝑚 .

for t = 1, … , T && the constraint condition 𝜂𝜂 is satisfied

train a weak learner using distribution 𝐷𝐷𝑡𝑡.

get a weak hypothesis ℎ𝑡𝑡:𝜒𝜒 → ℝ.

Choose 𝛼𝛼𝑡𝑡 ∈ ℝ.

Update: 𝐷𝐷𝑡𝑡+1(𝑖𝑖) = 𝐷𝐷𝑡𝑡(𝑖𝑖)𝐾𝐾𝑡𝑡(𝜒𝜒𝑖𝑖,𝑦𝑦𝑖𝑖)𝑍𝑍𝑡𝑡

(14)

where 𝑍𝑍𝑡𝑡 is a normalization factor (chosen so that 𝐷𝐷𝑡𝑡+1will be a distribution).

output the final hypothesis:

𝐻𝐻(𝜒𝜒) = 𝑠𝑠𝑖𝑖𝑠𝑠𝑠𝑠(∑ 𝛼𝛼𝑡𝑡ℎ𝑡𝑡(𝜒𝜒𝑇𝑇𝑡𝑡=1 )) (15)

imbalances in two different ways to solve the skewed vector space problem. the first method, which is classifier-independent, is to balance the distributions by considering the representative proportions of class examples in the distribution of the original data. the simplest way to balance a dataset is to under-sample or over-sample (randomly or selectively) the majority class, while maintaining the original minority class population [34]. one of the most common pre-processing methods to balance a dataset, Synthetic Minority over-sampling technique (SMote) [31],over-samples the minority class by taking each minority class sample and introducing synthetic examples along the line segments joining any or all of the k minority class nearest neighbors. evidence shows that synthetic sampling methods are effective when dealing with learning from imbalanced data [27] [31] [34].

Working with classifiers to adapt datasets is another way todeal with the single-instance imbalanced data problem. the theoretical foundation and algorithms of cost-sensitive methods naturally apply to imbalanced learning problems [29][30].thus, for imbalanced learning domains, cost-sensitive techniques provide a viable alternative to sampling methods. Recent research ([27][29][34]) suggests that assigning distinct costs to the training examples is a fundamental approach of this type, and various experimental studies of this ([23][25][36])have been performed using different kinds of classifiers.

the work of [48] provides a cost-sensitive boosting algorithmfor imbalanced multi-instance classification. this algorithm makes modifications based on the original Adaboost algorithm[19] for imbalanced multi-instance datasets.

the original AdaBoost [19] iteratively updates the distribution function over the training data. this means that for every iteration t = 1, … , T, where T is a given number of the total number of iterations, the distribution function 𝐷𝐷𝑡𝑡 is updated sequentially, and used to train a new hypothesis:

𝐷𝐷𝑡𝑡+1(𝑖𝑖) = 𝐷𝐷𝑡𝑡(𝑖𝑖) 𝑒𝑒𝑒𝑒𝑒𝑒 (−𝛼𝛼𝑡𝑡𝑦𝑦𝑖𝑖ℎ𝑡𝑡(𝑒𝑒𝑖𝑖))𝑍𝑍𝑡𝑡

(10)

where 𝛼𝛼𝑡𝑡 = 12𝑙𝑙𝑠𝑠 �1−𝜀𝜀𝑡𝑡

𝜀𝜀𝑡𝑡� is the weight updating parameter,

ht(xi) is the prediction output of hypothesis ht on the instance xi , εt is the error of hypothesis ht over the training data, and Zt is a normalization factor. here each xi is an n-tuple of attribute values belonging to a certain domain or instance space X, and yi is a label in a label set Y.

Schapire and Singer [24] used a generalized version of Adaboost. As shown in [24], the training error of the final classifier is bounded as:

1𝑚𝑚

|{𝑖𝑖:𝐻𝐻(𝑥𝑥𝑖𝑖) ≠ 𝑦𝑦𝑖𝑖}| ≤ ∏ 𝑍𝑍𝑡𝑡𝑡𝑡 (11)where

Zt = �Dt(i)i

exp�−αtyiht(xi)�

≤ ∑ 𝐷𝐷𝑡𝑡(𝑖𝑖)𝑖𝑖 �1+𝑦𝑦𝑖𝑖ℎ𝑡𝑡(𝑒𝑒𝑖𝑖)2

𝑒𝑒−𝛼𝛼 + 1+𝑦𝑦𝑖𝑖ℎ𝑡𝑡(𝑒𝑒𝑖𝑖)2

𝑒𝑒𝛼𝛼� (12)Minimizing Zt on each round, αt is induced as:

αt = 12

ln �∑ Dt(i)i,yi=ht�xi�

∑ Dt(i)i,yi≠ht�xi�� (13)

the weighting strategy of AdaBoost identifies samples on

their classification outputs as correctly classified or misclassified. however, it treats samples of different classes equally. the weights of misclassified samples from different classes are increased by an identical ratio, and the weights of correctly classified samples from different classes are decreased by an identical ratio.

fig. 1. Cost-sensitive Adaboost for Multi-instance Learning AlgorithmSince boosting is suitable for cost-sensitive adaption,

motivated by [6]’s analysis and methods for choosing αt, and several cost-sensitive boosting methods [30] [36] [29] for imbalanced single instance learning have been proposed in recent years. the work of [48] applied cost-minimizing techniques to the combination schemes of ensemble methodsfor imbalanced multi-instance datasets. this learning objective expects that the weighting strategy of a boosting algorithm will preserve a considerable weighted sample size of the minorityclass. A preferred boosting strategy is one that can distinguish different types of samples, and boost more weights on those samples associated with higher identification importance.

To denote the different identification importance among bags,each bag is associated with a cost item. for an imbalanced multi-instance dataset, there are many more bags with class label y = −1 than bags with class label y = +1 . Using the same learning framework as AdaBoost, the cost items can be fed into the weight update formula of AdaBoost (eq. (1)) to bias the weighting strategy. the proposed methods are similarto those proposed in Ref. [18]. fig. 1 shows the proposed algorithms.

in the original adaboost, 𝐾𝐾𝑡𝑡(𝜒𝜒𝑖𝑖 ,𝑦𝑦𝑖𝑖) is given as exp (−𝛼𝛼𝑡𝑡𝑦𝑦𝑖𝑖ℎ𝑡𝑡(𝜒𝜒𝑖𝑖)) . in Cost-sensitive Adaboost for Multi-instance Learning Algorithm, the modifications of 𝐾𝐾𝑡𝑡(𝜒𝜒𝑖𝑖 ,𝑦𝑦𝑖𝑖) are then given by:

Ab1: 𝐾𝐾𝑡𝑡(𝜒𝜒𝑖𝑖 ,𝑦𝑦𝑖𝑖) = exp (−𝐶𝐶𝑖𝑖𝛼𝛼𝑡𝑡𝑦𝑦𝑖𝑖ℎ𝑡𝑡(χ𝑖𝑖)) (16)Ab2: 𝐾𝐾𝑡𝑡(𝜒𝜒𝑖𝑖 ,𝑦𝑦𝑖𝑖) = 𝐶𝐶𝑖𝑖exp (−𝛼𝛼𝑡𝑡𝑦𝑦𝑖𝑖ℎ𝑡𝑡(χ𝑖𝑖)) (17)


Pr (y = 1 | b) =[∏n

i Pr (y = 1 | xi)]1n

[∏ni Pr (y = 1 | xi)]

1n +[∏n

i Pr (y = 0 | xi)]1n=

exp(1

n β∑i xi)

1+ exp(1

n β∑i xi) (9)

Pr (y = 0 | b) =[∏n

i Pr (y = 0 | xi)]1n

[∏ni Pr (y = 1 | xi)]

1n +[∏n

i Pr (y = 0 | xi)]1n=

11+ exp

(1n β∑i xi

) (10)

LL =N

∑i=1

[yilogPr (y = 1 | b)+(1− yi)logPr (y = 0 | b)] (11)

Dt+1 (i) =Dt (i) exp(−αtyiht(xi))

Zt(12)

whereαt =12 ln

(1−αt

αt

)is the weight updating

parameter, ht(xi) is the prediction output of hy-pothesis ht on the instance xi, εt is the error ofhypothesis ht over the training data, and Zt is a nor-malization factor. Here each xi is an n-tuple of at-tribute values belonging to a certain domain or in-stance space X, and yi is a label in a label set Y.

Schapire and Singer [24] used a generalizedversion of Adaboost. As shown in [24], the train-ing error of the final classifier is bounded as:

1m|{i : H(xi) = yi}| ≤ ∏

tZt (13)

where

Zt = ∑i

Dt (i)exp(−αtyiht (xi))

≤ ∑i

Dt (i)(

1+ yiht (xi)

2e−α +

1+ yiht (xi)

2eα)

(14)Minimizing Zt on each round, αt is induced as:

αt =12

ln

(∑i,yi=ht(xi) Dt (i)

∑i,yi =ht(xi) Dt (i)

)(15)

The weighting strategy of AdaBoost identifies sam-ples on their classification outputs as correctly clas-sified or misclassified. However, it treats samples ofdifferent classes equally. The weights of misclassi-fied samples from different classes are increased byan identical ratio, and the weights of correctly clas-sified samples from different classes are decreasedby an identical ratio.

Figure 1. Cost-sensitive Adaboost forMulti-Instance Learning Algorithm

Since boosting is suitable for cost-sensitiveadaption, motivated by [6]’s analysis and methodsfor choosing αt, and several cost-sensitive boost-ing methods [30] [36] [29] for imbalanced sin-gle instance learning have been proposed in recentyears. The work of [48] applied cost-minimizingtechniques to the combination schemes of ensem-ble methods for imbalanced multi-instance datasets.This learning objective expects that the weightingstrategy of a boosting algorithm will preserve aconsiderable weighted sample size of the minorityclass. A preferred boosting strategy is one that candistinguish different types of samples, and boostmore weights on those samples associated withhigher identification importance.

To denote the different identification impor-tance among bags, each bag is associated with a costitem. For an imbalanced multi-instance dataset,there are many more bags with class label y =−1than bags with class label y =+1. Using the same


learning framework as AdaBoost, the cost items canbe fed into the weight update formula of AdaBoost(Eq. (1)) to bias the weighting strategy. The pro-posed methods are similar to those proposed in Ref.[18]. Fig. 1 shows the proposed algorithms.

In the original adaboost, Kt (χi,yi) is given asexpα(−αtyiht(χi)). In Cost-sensitive Adaboost forMulti-Instance Learning Algorithm, the modifica-tions of Kt (χi,yi) are then given by:

Ab1:

Kt (χi,yi)= exp(−Ciαtyiht(χi)) (16)

Ab2:

Kt (χi,yi)= Ciexp(−αtyiht(χi)) (17)

Ab3:

Kt (χi,yi)= Ciexp(−Ciαtyiht(χi)) (18)

Ab4:

Kt (χi,yi)= C2i exp(−C2

i αtyiht(χi)) (19)

Respectively, for αt and η, from [48] we can get(20)-(27)

On the other hand, similar to single-viewMLO classification, we can also apply bag over-sampling, a classifier-independent method on im-balanced multi-views MLO classification. TheBag Over Sampling is a bag level over-samplingapproach in which the minority class is over-sampled with replacement.

We have presented two approaches for the classimbalance problem in Multi-views MLO classifi-cation. One advantage of these approaches is thatboth of them are learner independent. Thereforethese two approaches can be applied for the multi-instance learning and the DS fusion methods whichare presented in previous two sections.

6 Data preprocessing

The first step of this classification task is thesegmentation of the sonar images into three distinctregions: highlight or target echo (sound scatteredby the target by active sonar), shadow (regions oflow acoustic energy created by an object or seabed

feature blocking the sound propagation) and back-ground or seabed.

Figure 2. Example of an image processing resulton an image provided by the Ocean System Lab,

Heriot-Watt University

In mine countermeasure missions (MCM),sonar images collected by AUVs will convey impor-tant information about the underwater conditions.How to properly process the sonar images will havea significant impact on the subsequent MLOs detec-tion and classification stages.

In the MCMs, a large part of sonar images col-lected by AUVs represent the background—seabed.In MLOs detection and classification, we are moreinterested in the object that lies on the seabed ratherthan the background. The areas from the imageswith only background information can be simplydiscarded. Image segmentation is a widely usedimage processing technique to detect target objectsand segment the original images into small piecesthat contain the target objects. The foreground ob-jects are assumed to have a more complex texturethan the seabed. Thus, the foreground object areasare obtained by using local range and standard de-viation filters.

Instead of dealing with the whole sonar image,image segmentation allows us to only process thesmaller pieces, reducing the future computationalload. In this step, our goal is to delete image datathat contain only background information and re-duce the amount of data to be processed. Thereforewhether the size, shape and location of the targetobject are accurately found is not a main concern inthis step.

The objective of the image processing proce-dures at this point is data reduction rather thanMLOs detection. Thus, a relatively high false alarmrate is acceptable.

Fig. 2 illustrates the extraction of foregroundobjects from a sonar image which was providedby the Ocean Systems Lab, Heriot-Watt University.


f a f a (20)

ηAb1 : ∑i,yi=ht(χi)

CiDt (i)> ∑i,yi =ht(χi)

CiDt (i) (21)

αt Ab2=12

ln

(∑i,yi=ht(χi)CiDt (i)

∑i,yi =ht(χi)CiDt (i)

)(22)



CiDt (i) (23)

ηt Ab3=12

ln

(∑iCiDt (i)+∑i,yi=ht(χi)C

2i Dt (i)−∑i,yi =ht(χi)C

2i Dt (i)

∑iCiDt (i)−∑i,yi=ht(χi)C2i Dt (i)+∑i,yi =ht(χi)C

2i Dt (i)

)(24)


C2i Dt (i)> ∑

i,yi =ht(χi)

C2i Dt (i) (25)

αt Ab4=12

ln

(∑iC2

i Dt (i)+∑i,yi=ht(χi)C4i Dt (i)−∑i,yi =ht(χi)C

4i Dt (i)

∑iC2i Dt (i)−∑i,yi=ht(χi)C

4i Dt (i)+∑i,yi =ht(χi)C

4i Dt (i)

)(26)


C4i Dt (i)> ∑

i,yi =ht(χi)

C4i Dt (i) (27)

Areas that do not have a reasonable size will be ig-nored.

For object detection tasks, an object should bedetected through a single view, no matter where andhow it lies on the seabed. Therefore, the featuresused should be robust to the location and orienta-tion of the object. The grayscale histogram, a sim-ple but informative statistical feature, is considered.In many image recognition systems, many complexfeatures are used, but such features will inevitablyincrease the computational complexity, impedingthe real time detection. The histogram is easy tocalculate and robust to rotation. The distribution ofthe grayscale value can be well described by thisfeature.

In our experiment, the grayscale value (0-255)is divided into 16 bins with width 16. The grayscalehistograms are normalized to the frequency that apixel value falls into each bin. The MLOs are la-beled as the positive examples.

7 Experimental Results Of Multi-aspects Images Classification

7.1 Classification on multi-views of object

In the experiments, we study the classificationperformances as a function of the number of aspectsand compare the experimental result using DS andMulti-instance classifiers.

The binary dataset used in this empirical studyis described in TABLE I. which has binary class.The negative examples denote the non MLOs andthe positive examples denote the MLOs. In thisexperiment each object has three views so we canstudy the classification performances as a functionof the number of views. ROC curves are chosenas the measure technique for the classification. Theexperimental results are shown in Figure 3 and Fig-ure 4.

Figure 3 shows the ROC curves as a function ofthe number of aspects using MITI as the classifierand Figure 4 shows the ROC curves using DS withdecision tree as the classifier. We can find that forboth classifiers, with more views used for classifi-cation, the performance is better.

Wang X., Liu X., Japkowicz N. and Stan Matwin

αt Ab1=12 ln

(1+∑i,yi=ht(χi)

CiDt(i)−∑i,yi =ht(χi)CiDt(i)

1−∑i,yi=ht(χi)CiDt(i)+∑i,yi =ht(χi)

CiDt(i)

)(20)



CiDt (i) (21)

αt Ab2=12

ln



)(22)



CiDt (i) (23)

ηt Ab3=12

ln



2i Dt (i)


2i Dt (i)

)(24)


C2i Dt (i)> ∑

i,yi =ht(χi)

C2i Dt (i) (25)

αt Ab4=12

ln

(∑iC2


4i Dt (i)



4i Dt (i)

)(26)


C4i Dt (i)> ∑

i,yi =ht(χi)

C4i Dt (i) (27)

Fig. 2 illustrates the extraction of foregroundobjects from a sonar image which was providedby the Ocean Systems Lab, Heriot-Watt University.Areas that do not have a reasonable size will be ig-nored.









f a f a (20)



CiDt (i) (21)

αt Ab2=12

ln



)(22)



CiDt (i) (23)

ηt Ab3=12

ln



2i Dt (i)


2i Dt (i)

)(24)


C2i Dt (i)> ∑

i,yi =ht(χi)

C2i Dt (i) (25)

αt Ab4=12

ln

(∑iC2


4i Dt (i)



4i Dt (i)

)(26)


C4i Dt (i)> ∑

i,yi =ht(χi)

C4i Dt (i) (27)

Areas that do not have a reasonable size will be ig-nored.









Table 1. Multi-views mlo datasets

Datasets # objects # attribute # positive examples # negative examplesMLOa # 360 # 16 # 180 180

Table 2. Multi-instance class imbalanced datasets

Datasets # objects # attribute # min objects % min objects # min instances % min instancesMLO1 561 16 58 10.34 116 10.34MLO2 555 16 64 11.53 144 12.18MLO3 425 16 65 15.29 158 17.67

7.2 Classification on class imbalancedmulti-views of object

The datasets utilized in our empirical study aredescribed in TABLE II. The percentage of minor-ity bags varies from 8.27% to 15.29%. All datasetshave a binary class. All of these datasets have morethan one “view” on an object.

Figure 3. Classification performances as afunction of the number of aspects using MITI as

the classifier

To manage the significant number of possiblecombinations of images for multiple views, two fu-sion approaches are used to fuse the output proba-bilities.

The first approach is to use a multi-instancelearning method to study the classification perfor-mances as a function of the number of aspects andthe Multi-instance logistic regression classifier ischosen as the multi-aspect classifier. The secondapproach is fusing the output probabilities from the

single aspect classifier. The Dempster-Shafer (DS)method is used to fuse the results as a decision fu-sion method and the logistic regression classifier ischosen as the single aspect classifier.

Figure 4. Classification performances as afunction of the number of aspects using DS on

decision tree as the classifier

Since in learning from extremely imbalanceddata, a trivial classifier that predicts every case asthe majority class can still achieve very high accu-racy, the overall classification accuracy is often notan appropriate measure of performance. We chooseGmean [2] and F-measure as the measures for ouralgorithm and experiment. The definition of Gmeanis listed in Table III.

Specificity: true Negative Rate

Acc− =T N

T N +FP(28)

Sensitivity: true Positive Rate


Table 3. Confusion matrix

Predicted Positive Class Predicted Negative ClassActual Positive class TP (True Positive) FN (False Negative)Actual Negative class FP (False Positive) TN (True Negative)


Datasets # objects # attribute # cylinder % manta # weddingcakeMLOb 279 16 93 93 93

Acc+ =T P

T P+FN(29)

Gmean = (Acc−×Acc+)1/2 (30)

TABLE XII and TABLE XIII show the exper-imental results of this study. Comparing the as-pect classification rates with the two multi-aspectapproaches, we see that collecting multiple viewsproduces a significant increase in Gmean and F-measure for classification. Moreover, the multi-instance learning method gets better classifica-tion performance than the Dempster-Shafer (DS)method with single aspect classifier on all shapeson the same number of aspects combined.

7.3 Classification on MLOs with multi-views

We have three different shapes of MLOs whichare cylinder, manta and wedding cake shapes. Aftermaking a classification of MLOs and non MLOs,we can keep on making a classification on whatkind of shape the MLO belongs to. TABLE IVshows the details of this dataset. TABLE V toTABLE VII show the confusion matrices result-ing from single-aspect classification using decisiontree, multi-aspects classification using MITI andmulti-aspects classification using DS with decisiontree respectively.

TABLE VIII to TABLE X give the confusionmatrices resulting from single-aspect classificationusing Logistic Regression, multi-aspects classifica-tion using MILR and multi-aspects classificationusing DS with Logistic Regression respectively.

From these classification results we can see thatthe classification performance, both on using themulti-instance framework and data fusion frame-work, were improved by using more “views” in the

classification.

7.4 Statistical test method

As Friedman’s test [40] is a non-parametric sta-tistical test for multiple classifiers and multiple do-mains, we performed it on the results in TABLE XIIand TABLE XIII. The null hypothesis for this test isthat all the classifiers perform equally, and rejectionof the null hypothesis means that there is at least onepair of classifiers with significantly different perfor-mance. This test is performed on the multiplicationresults of Gmean and F-measure.

Friedman’s test result is shown in the TABLEXI.

Since Friedman’s test shows that these clas-sifiers perform differently, we then applied Ne-menyi’s post-hoc test [40] to determine which clas-sifier has better performance than others. By com-paring their q values [40] with the critical valueqC = 3.22, we can determine if one classifier is bet-ter than the other one: positive and bigger than qC

–lose; negative and the absolute value larger thanqC –win; other cases –equal.

The scores of all the classifiers in TABLE XIIand TABLE XIII are presented in TABLE XIV. Theresult of 4-2-0 for Ab1 means that this classifierwins 4 times, ties 2 times, and loses zero times.If we set the scores as win=1, equal=0 and lose=-1, the score of each classifier can be calculated.The total score of these classifiers using MITI andDS with decision tree as base learners can also becalculated. From the result we can find that Ab1,Ab3 and Ab4 show better performance in theseclassifiers dealing with class imbalanced multipleviews classification. On the other hand, combinedwith MITI, cost-sensitive boosting method has thechance to get the best performance in all presentedclassifiers.


Table 3. Confusion matrix

Predicted Positive Class Predicted Negative ClassActual Positive class TP (True Positive) FN (False Negative)Actual Negative class FP (False Positive) TN (True Negative)


Datasets # objects # attribute # cylinder % manta # weddingcakeMLOb 279 16 93 93 93

Acc+ =T P

T P+FN(29)

Gmean = (Acc−×Acc+)1/2 (30)

TABLE XII and TABLE XIII show the exper-imental results of this study. Comparing the as-pect classification rates with the two multi-aspectapproaches, we see that collecting multiple viewsproduces a significant increase in Gmean and F-measure for classification. Moreover, the multi-instance learning method gets better classifica-tion performance than the Dempster-Shafer (DS)method with single aspect classifier on all shapeson the same number of aspects combined.

7.3 Classification on MLOs with multi-views

We have three different shapes of MLOs whichare cylinder, manta and wedding cake shapes. Aftermaking a classification of MLOs and non MLOs,we can keep on making a classification on whatkind of shape the MLO belongs to. TABLE IVshows the details of this dataset. TABLE V toTABLE VII show the confusion matrices result-ing from single-aspect classification using decisiontree, multi-aspects classification using MITI andmulti-aspects classification using DS with decisiontree respectively.

TABLE VIII to TABLE X give the confusionmatrices resulting from single-aspect classificationusing Logistic Regression, multi-aspects classifica-tion using MILR and multi-aspects classificationusing DS with Logistic Regression respectively.

From these classification results we can see thatthe classification performance, both on using themulti-instance framework and data fusion frame-work, were improved by using more “views” in the

classification.

7.4 Statistical test method

As Friedman’s test [40] is a non-parametric sta-tistical test for multiple classifiers and multiple do-mains, we performed it on the results in TABLE XIIand TABLE XIII. The null hypothesis for this test isthat all the classifiers perform equally, and rejectionof the null hypothesis means that there is at least onepair of classifiers with significantly different perfor-mance. This test is performed on the multiplicationresults of Gmean and F-measure.

Friedman’s test result is shown in the TABLEXI.

Since Friedman’s test shows that these clas-sifiers perform differently, we then applied Ne-menyi’s post-hoc test [40] to determine which clas-sifier has better performance than others. By com-paring their q values [40] with the critical valueqC = 3.22, we can determine if one classifier is bet-ter than the other one: positive and bigger than qC

–lose; negative and the absolute value larger thanqC –win; other cases –equal.

The scores of all the classifiers in TABLE XIIand TABLE XIII are presented in TABLE XIV. Theresult of 4-2-0 for Ab1 means that this classifierwins 4 times, ties 2 times, and loses zero times.If we set the scores as win=1, equal=0 and lose=-1, the score of each classifier can be calculated.The total score of these classifiers using MITI andDS with decision tree as base learners can also becalculated. From the result we can find that Ab1,Ab3 and Ab4 show better performance in theseclassifiers dealing with class imbalanced multipleviews classification. On the other hand, combinedwith MITI, cost-sensitive boosting method has thechance to get the best performance in all presentedclassifiers.


Table 5. The Confusion Matrices Resulting From Single-Aspect Classification Using Decision Tree

(a) Single AspectCylinder Manta Wedding Cake

Cylinder 58.9 24.9 16.2Manta 11.9 73.5 14.6

Wedding Cake 11.4 16.2 72.4

Table 6. The Confusion Matrices Resulting From Multi-Aspects Classification Using MITI

(b) Multi AspectsCylinder Manta Wedding Cake

Cylinder 80.6 11.8 7.5Manta 8.6 83.3 8.1


Table 7. The Confusion Matrices Resulting From Multi-Aspects Classification Using DS Fusion WithDecision Tree

(c) Multi AspectsCylinder Manta Wedding Cake

Cylinder 73.6 17.2 9.2Manta 4.8 84.4 10.8


Table 8. The Confusion Matrices Resulting From Single-Aspect Classification Using Logisitic Regreesion

(a) Single AspectCylinder Manta Wedding Cake

Cylinder 58.9 22.7 18.4Manta 30.8 55.1 14.1


Table 9. The Confusion Matrices Resulting From Multi-Aspects Classification Using MILR

(b) Multi AspectsCylinder Manta Wedding Cake

Cylinder 72.6 16.7 10.7Manta 10.3 76.3 5.4


Table 10. The Confusion Matrices Resulting From Multi-Aspects Classification Using DS Fusion WithLogisitic Regreesion

(c) Multi AspectsCylinder Manta Wedding Cake

Cylinder 69.4 19.9 10.7Manta 16.7 75.3 8.0



Table 11. Friedman’s test result of TABLE XII

Friedman χ2 df p-value Critical χ2

22.7143 6 0.000898 12.5922.7143 > 12.59, hypothesis rejected

Table 12. Friedman’s test result of TABLE XIII



8 Conclusions

In this paper, we have considered the improvingin the classification of sidescan sonar images ob-tained by using feature sets corresponding to multi-ple sonar views of the same object. There are twobasic ways in which the multiple feature sets canbe utilized. The first approach consists of fusingthe multiple individual classification of the multiplefeature vectors with the DS method. The secondapproach uses multi-instance classification methodsto classify multiple feature vectors. Tree methodsand Logistic Regression methods were chosen asthe base learners for these two approaches in ourexperiments.

Moreover, class imbalanced problem in MLOclassification was also considered in this paper. Wepresented two frameworks to deal with the multipleviews class imbalanced problem in MLO classifica-tion. The first framework is a classifier-independentapproach which uses bag over-sampling method toincrease the minority instance numbers. The secondframework is the Cost-sensitive boosting methodfor multiple views classification.

Our experimental results show that for MLOclassification, given multiple views of an object,knowledge of the classification performance ofmultiple views is needed as by revisiting some ofthe contacts at suboptimal aspects, the overall sur-vey time can be reduced. Using the multi-aspectside scan sonar images of various mine-like objectshapes and non mine-like objects, we constructedsecondary view classification curves to be used inconjunction with a path planning algorithm.

We have also studied the classification perfor-mances as a function of the number of aspects.Comparing the aspect classification rates with twomulti-aspect approaches on different shapes, we

see that collecting multiple views produces a sig-nificant increase in hit rate and a significant de-crease in error rate for all mine shapes. More-over, the multi-instance learning method gets bet-ter classification performance than the Dempster-Shafer (DS) method with the single aspect classifieron all shapes on the same number of aspects com-bined.

For the multi-views class imbalance prob-lem, we have provided two novel frameworks forthis problem: using data generated method or acost-sensitive boosting method. Based on thesemethods, we have presented experimental analy-sis using different learning algorithms with MLOdatasets. Experimental evidence derived from stan-dard datasets was presented to support the cost-sensitive optimality of the proposed algorithms. Wefound that the cost-sensitive boosting with MILconsistently and significantly outperformed all theother methods tested.

References[1] B. Zerr, B. Stage. Three-dimensional reconstruc-

tion of underwater objects from a sequence ofsonar images, Proceedings of the IEEE Interna-tional Conference on Image Processing, pp. 927–930, (1996).

[2] B. Zerr, B. Stage and A. Guerrero, Automatic Tar-get Classification Using Multiple Sidescan SonarImages of Different Orientations, SACLANT CENMemorandum SM-309 (1997).

[3] B. Zerr, E. Bovio, B. Stage, Automatic mine clas-sification approach based on AUV manoeuverabil-ity and cots side scan sonar, Proceedings of Goats2001 Conference, La Spezia, Italy, (2001).

[4] M. Couillard, J. Fawcett, M. Davison and V. My-ers, Optimizing time-limited multi-aspect classifi-


Table 11. Friedman’s test result of TABLE XII



Table 12. Friedman’s test result of TABLE XIII



8 Conclusions

In this paper, we have considered the improvingin the classification of sidescan sonar images ob-tained by using feature sets corresponding to multi-ple sonar views of the same object. There are twobasic ways in which the multiple feature sets canbe utilized. The first approach consists of fusingthe multiple individual classification of the multiplefeature vectors with the DS method. The secondapproach uses multi-instance classification methodsto classify multiple feature vectors. Tree methodsand Logistic Regression methods were chosen asthe base learners for these two approaches in ourexperiments.

Moreover, class imbalanced problem in MLOclassification was also considered in this paper. Wepresented two frameworks to deal with the multipleviews class imbalanced problem in MLO classifica-tion. The first framework is a classifier-independentapproach which uses bag over-sampling method toincrease the minority instance numbers. The secondframework is the Cost-sensitive boosting methodfor multiple views classification.

Our experimental results show that for MLOclassification, given multiple views of an object,knowledge of the classification performance ofmultiple views is needed as by revisiting some ofthe contacts at suboptimal aspects, the overall sur-vey time can be reduced. Using the multi-aspectside scan sonar images of various mine-like objectshapes and non mine-like objects, we constructedsecondary view classification curves to be used inconjunction with a path planning algorithm.

We have also studied the classification perfor-mances as a function of the number of aspects.Comparing the aspect classification rates with twomulti-aspect approaches on different shapes, we

see that collecting multiple views produces a sig-nificant increase in hit rate and a significant de-crease in error rate for all mine shapes. More-over, the multi-instance learning method gets bet-ter classification performance than the Dempster-Shafer (DS) method with the single aspect classifieron all shapes on the same number of aspects com-bined.

For the multi-views class imbalance prob-lem, we have provided two novel frameworks forthis problem: using data generated method or acost-sensitive boosting method. Based on thesemethods, we have presented experimental analy-sis using different learning algorithms with MLOdatasets. Experimental evidence derived from stan-dard datasets was presented to support the cost-sensitive optimality of the proposed algorithms. Wefound that the cost-sensitive boosting with MILconsistently and significantly outperformed all theother methods tested.

References[1] B. Zerr, B. Stage. Three-dimensional reconstruc-

tion of underwater objects from a sequence ofsonar images, Proceedings of the IEEE Interna-tional Conference on Image Processing, pp. 927–930, (1996).

[2] B. Zerr, B. Stage and A. Guerrero, Automatic Tar-get Classification Using Multiple Sidescan SonarImages of Different Orientations, SACLANT CENMemorandum SM-309 (1997).

[3] B. Zerr, E. Bovio, B. Stage, Automatic mine clas-sification approach based on AUV manoeuverabil-ity and cots side scan sonar, Proceedings of Goats2001 Conference, La Spezia, Italy, (2001).

[4] M. Couillard, J. Fawcett, M. Davison and V. My-ers, Optimizing time-limited multi-aspect classifi-


cation, Proceedings of the Institute of Acoustics29(6), 89-96 (2007).

[5] J. Fawcett, V. Myers, D. Hopkin, A. Crawford,M. Couillard, B. Zerr. Multiaspect classification ofsidescan sonar images: Four different approachesto fusing single-aspect information, Oceanic Engi-neering, IEEE Journal of 35(4): 863 –876 (2010).

[6] S. Reed, Y. Petillot, J. Bell, Model-based ap-proach to the detection and classification of minesin side scan sonar, Applied Optics 43(2): 237– 246.(2004).

[7] S. Reed, Y. Petillot, J. Bell, Automated approachto classification of mine-like features in sidescansonar using highlight and shadow information, IEEProc. Radar, Sonar & Navigation 151 (No.1), 48-56, (2004).

[8] V. Myers, D. P. Williams, A POMDP for multi-view target classification with an autonomous un-derwater vehicle, OCEANS, pp. 1-5, (2010).

[9] V. Myers, D. P. Williams, Adaptive Multiview Tar-get Classification in Synthetic Aperture Sonar Im-ages Using a Partially Observable Markov Deci-sion Process, Oceanic Engineering, IEEE Journalof, On page(s): 45 - 55, Volume: 37 Issue: 1, Jan.(2012)

[10] D. Williams, V. Myers, and M. Silvious, ”MineClassification with Imbalanced Data,” IEEE Geo-science and Remote Sensing Letters, Vol. 6, No. 3,pp. 528-532, July 2009.

[11] G. Dobeck, “Fusing sonar images for mine detec-tion and classification,” Proc. SPIE—Int. Soc. Opt.Eng., vol. 3710, 1999, DOI: 10.1117/12.357082.

[12] J. Tucker, N. Klausner, and M. Azimi-Sadjadi,“Target detection in M-disparate sonar platformsusing multichannel hypothesis testing,” in Proc.OCEANS Conf., Quebec City, QC, Canada, 2008,DOI: 10.1109/ OCEANS.2008.5151818.

[13] M. Azimi-Sadjadi, A. Jamshidi, and G. Dobeck,“Adaptive underwater target classification withmulti-aspect decision feedback,” Proc. SPIE—Int. Soc. Opt. Eng., vol. 4394, 2001, DOI:10.1117/12.445444.

[14] X. Xu and E. Frank. Logistic regression and boost-ing for labeled bags of instances. In Lecture Notesin Computer Science, volume 3056, pages 272–281, April 2004.

[15] Hall, D. L. and Steinberg, A.. Dirty Secrets in Mul-tisensor Data Fusion, http://www.dtic.mil. (2001).

[16] Blockeel, H., Page, D., Srinivasan, A.: Multi-instance tree learning. In: ICML (2005).

[17] Quinlan, J. R. C4.5: Programs for Machine Learn-ing. Morgan Kaufmann Publishers, (1993).

[18] Hosmer, David W.; Lemeshow, Stanley. AppliedLogistic Regression (2nd ed.). Wiley. (2000).

[19] Freund,Y., Schapire,R.E.: Experiments with a newboosting algorithm. In: Machine Learning: Pro-ceedings of the Thirteenth International Confer-ence, pp. 148–156 (1996)

[20] Kubat, M., & Matwin, S.: Addressing the curse ofimbalanced training sets: One-sided selection. In:Proceddings of the Fourteenth International Con-ference on Machine Learning, 179-186 (1997)

[21] Dietterich, T., Lathrop, R., Lozano-Perez, T.: Solv-ing the multiple instance problem with the axis-parallel rectangles. In: Artificial Intelligence, 89(1-2), 31–71 (1997)

[22] Maron, O., Lozano-Pz T.: A framework for multi-ple instance learning. In: Proc. of the 1997 Conf.on Advances in Neural Information ProcessingSystems 10, p.570-576 (1998)

[23] Fan, W., Stolfo,S.J., Zhang,J., Chan,P.K.: Ada-Cost: Misclassification Cost-Sensitive Boosting.In: Proc. Int’l Conf. Machine Learning, pp. 97-105(1999)

[24] Schapire,R.E., Singer,Y.: Improved boosting algo-rithms using confidence-rated predictions. In: Ma-chine Learning, 37 (3) 297–336 (1999)

[25] Ting, K.M.: A Comparative Study of Cost-Sensitive Boosting Algorithms. In: Proc. Int’lConf. Machine Learning, pp. 983-990 (2000)

[26] Wang, J., Zucker, J.D.: Solving the multiple-instance problem: A lazy learning approach. In:ICML (2000)

[27] Japkowicz, N.: Learning from Imbalanced DataSets: A Comparison of Various Strategies. In:Proc. Am. Assoc. for Artificial Intelligence(AAAI)Workshop Learning from Imbalanced Data Sets,pp. 10-15. (Technical Report WS-00-05) (2000)

[28] Zhang, Q., Goldman, S. A.: EM-DD: An improvedmultiple instance learning technique. In: Neural In-formation Processing Systems 14 (2001)

[29] Elkan, C.: The Foundations of Cost-SensitiveLearning. In: Proc. Int’l Joint Conf. Artificial In-telligence, pp. 973-978 (2001)

[30] Ting, K.M.: An Instance-Weighting Method to In-duce Cost-Sensitive Trees. In: IEEE Trans. Knowl-edge and Data Eng., vol. 14, no. 3, pp. 659-665(2002)


[31] Chawla,N. V., Bowyer,K. W., Hall,L. O.,Kegelmeyer, W. P.: SMOTE: Synthetic Mi-nority Over-sampling Technique. In: Journal ofArtificial Intelligence Research, 16: 321-357(2002)

[32] Zhang, M.L., Goldman, S.: Em-dd: An improvedmulti-instance learning technique. In: NIPS (2002)

[33] Andrews, S., Tsochandaridis, I., Hofman, T.: Sup-port vector machines for multiple instance learn-ing. In: Adv. Neural. Inf. Process. Syst. 15, 561–568 (2003)

[34] Batista, G.E.A.P.A., Prati,R.C., Monard,M.C.: AStudy of the Behavior of Several Methods for Bal-ancing Machine Learning Training Data. In: ACMSIGKDD Explorations Newsletter, vol. 6, no. 1,pp. 20-29 (2004)

[35] Blockeel, H., Page, D., Srinivasan, A.: Multi-instance tree learning. In: ICML (2005)

[36] Sun,Y., Kamel,M.S., Wong, A.K.C., Wang, Y.:Cost-Sensitive Boosting for Classification of Im-balanced Data. In: Pattern Recognition, vol. 40, no.12, pp. 3358-3378 (2007)

[37] Foulds, J., Frank, E.: Revisiting multiple-instancelearning via embedded instance selection. In: W.Wobcke & M. Zhang(Eds), 21st Australasian JointConference on Artificial Intelligence Auckland,New Zealand, (pp. 300-310) (2008)

[38] Leistner, C., Saffari, A., and Bischof, H.: MI-Forests: Multiple Instance Learning with Random-ized Trees. In: Proc. ECCV (2010)

[39] Bjerring, L., Frank, E.: Beyond trees: AdoptingMITI to learn rules and ensemble classifiers formulti-instance data. In: D. Wang & M. Reynolds(Eds.), AI 2011, LNAI 7106 (pp. 41-50) (2011)

[40] Japkowicz, N., Shah, M.: Evaluating Learning Al-gorithms: A Classification Perspective. CambridgeUniversity Press (2011)

[41] Shawe-Taylor, J. and Cristianini, N.: Further re-sults on the margin distribution. In: Proceedingsof the 12th Conference on Computational Learn-ing Theory, 278-285 (1999)

[42] Morik, K., Brockhausen, P., Joachims, T.: Com-bining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive CareMonitoring. In: ICML: 268-277 (1999)

[43] Veropoulos, K., Campbell, C., & Cristianini,N.: Controlling the sensitivity of support vec-tor machines. In: Proceedings of the InternationalJoint Conference on Artificial Intelligence, 55–60.(1999)

[44] Chih-Chung Chang and Chih-Jen Lin.: LIBSVM: a library for support vector machines. In: ACMTransactions on Intelligent Systems and Technol-ogy, 2:27:1-27:27, (2011)

[45] Chin-Wei Hsu, Chih-Chung Chang and Chih-JenLin.: A practical guide to support vector classifi-cation. In: Technical Report, National Taiwan Uni-versity. (2010)

[46] Bergstra, James; Bengio, Yoshua. :Random Searchfor Hyper-Parameter Optimization. In: J. MachineLearning Research 13: 281-305. (2012)

[47] Wang, X., Shao, H., Japkowicz, N., Matwin,S., Liu, X., Bourque, A., Nguyen, B.: UsingSVM with Adaptively Asymmetric Misclassifica-tion Costs for Mine-Like Objects Detection. In:ICMLA (2012)

[48] Wang, X., Matwin, S., Japkowicz, N., Liu, X.:Cost-Sensitive Boosting Algorithms for Imbal-anced Multi-instance Datasets. In: Canadian Con-ference on AI (2013)

[49] Hosmer, David W. Lemeshow, Stanley: AppliedLogistic Regression. Wiley. ISBN 0-471-35632-8.(2000)


[31] Chawla,N. V., Bowyer,K. W., Hall,L. O.,Kegelmeyer, W. P.: SMOTE: Synthetic Mi-nority Over-sampling Technique. In: Journal ofArtificial Intelligence Research, 16: 321-357(2002)

[32] Zhang, M.L., Goldman, S.: Em-dd: An improvedmulti-instance learning technique. In: NIPS (2002)

[33] Andrews, S., Tsochandaridis, I., Hofman, T.: Sup-port vector machines for multiple instance learn-ing. In: Adv. Neural. Inf. Process. Syst. 15, 561–568 (2003)

[34] Batista, G.E.A.P.A., Prati,R.C., Monard,M.C.: AStudy of the Behavior of Several Methods for Bal-ancing Machine Learning Training Data. In: ACMSIGKDD Explorations Newsletter, vol. 6, no. 1,pp. 20-29 (2004)

[35] Blockeel, H., Page, D., Srinivasan, A.: Multi-instance tree learning. In: ICML (2005)

[36] Sun,Y., Kamel,M.S., Wong, A.K.C., Wang, Y.:Cost-Sensitive Boosting for Classification of Im-balanced Data. In: Pattern Recognition, vol. 40, no.12, pp. 3358-3378 (2007)

[37] Foulds, J., Frank, E.: Revisiting multiple-instancelearning via embedded instance selection. In: W.Wobcke & M. Zhang(Eds), 21st Australasian JointConference on Artificial Intelligence Auckland,New Zealand, (pp. 300-310) (2008)

[38] Leistner, C., Saffari, A., and Bischof, H.: MI-Forests: Multiple Instance Learning with Random-ized Trees. In: Proc. ECCV (2010)

[39] Bjerring, L., Frank, E.: Beyond trees: AdoptingMITI to learn rules and ensemble classifiers formulti-instance data. In: D. Wang & M. Reynolds(Eds.), AI 2011, LNAI 7106 (pp. 41-50) (2011)

[40] Japkowicz, N., Shah, M.: Evaluating Learning Al-gorithms: A Classification Perspective. CambridgeUniversity Press (2011)

[41] Shawe-Taylor, J. and Cristianini, N.: Further re-sults on the margin distribution. In: Proceedingsof the 12th Conference on Computational Learn-ing Theory, 278-285 (1999)

[42] Morik, K., Brockhausen, P., Joachims, T.: Com-bining Statistical Learning with a Knowledge-Based Approach - A Case Study in Intensive CareMonitoring. In: ICML: 268-277 (1999)

[43] Veropoulos, K., Campbell, C., & Cristianini,N.: Controlling the sensitivity of support vec-tor machines. In: Proceedings of the InternationalJoint Conference on Artificial Intelligence, 55–60.(1999)

[44] Chih-Chung Chang and Chih-Jen Lin.: LIBSVM: a library for support vector machines. In: ACMTransactions on Intelligent Systems and Technol-ogy, 2:27:1-27:27, (2011)

[45] Chin-Wei Hsu, Chih-Chung Chang and Chih-JenLin.: A practical guide to support vector classifi-cation. In: Technical Report, National Taiwan Uni-versity. (2010)

[46] Bergstra, James; Bengio, Yoshua. :Random Searchfor Hyper-Parameter Optimization. In: J. MachineLearning Research 13: 281-305. (2012)

[47] Wang, X., Shao, H., Japkowicz, N., Matwin,S., Liu, X., Bourque, A., Nguyen, B.: UsingSVM with Adaptively Asymmetric Misclassifica-tion Costs for Mine-Like Objects Detection. In:ICMLA (2012)

[48] Wang, X., Matwin, S., Japkowicz, N., Liu, X.:Cost-Sensitive Boosting Algorithms for Imbal-anced Multi-instance Datasets. In: Canadian Con-ference on AI (2013)

[49] Hosmer, David W. Lemeshow, Stanley: AppliedLogistic Regression. Wiley. ISBN 0-471-35632-8.(2000)


Table 13. Comparison of all presented algorithms for Class Imbalanced problem with MITI

DatasetsMethods TPRmin TNRmin Gmean Precision Recall F-measureMLO 1Base Learner 7.8±2.0 97.8±0.2 27.5±3.6 76.6±5.0 41.4±6.9 53.7±7.1Bag Over-Sampling 18.4±4.6 90.8±1.3 40.9±5.1 65.6±5.9 64.7±7.2 65.2±6.5

Adaboost 9.8±2.9 97.2±0.2 30.8±4.6 75.9±6.2 47.1±8.3 58.1±8.1Ab1 22.4±3.4 93.1±0.4 45.7±3.6 75.7±3.0 70.6±4.6 73.1±3.8Ab2 12.1±2.9 94.9±0.4 33.8±3.9 69.1±4.3 53.2±6.6 60.1±5.6Ab3 54.6±2.3 75.8±1.4 64.3±1.8 69.3±2.1 91.2±0.7 78.7±1.6Ab4 54.9±2.7 75.6±1.9 64.4±1.5 69.2±1.3 91.3±0.9 78.7±0.9

MLO 2Base Learner 43.9±5.9 96.7±0.3 65.2±4.3 92.9±0.8 85.5±3.1 89.1±2.1Bag Over-Sampling 47.7±4.7 94.7±0.2 67.2±3.4 89.9±1.2 87.3±2.2 88.5±1.7







Table 13. Comparison of all presented algorithms for Class Imbalanced problem with MITI

DatasetsMethods TPRmin TNRmin Gmean Precision Recall F-measureMLO 1Base Learner 7.8±2.0 97.8±0.2 27.5±3.6 76.6±5.0 41.4±6.9 53.7±7.1Bag Over-Sampling 18.4±4.6 90.8±1.3 40.9±5.1 65.6±5.9 64.7±7.2 65.2±6.5








9

[37] foulds, J., frank, e.: Revisiting multiple-instance learning via embedded instance selection. in: W. Wobcke & M. Zhang(eds), 21st Australasian Joint Conference on Artificial intelligence Auckland, New Zealand, (pp. 300-310) (2008)

[38] Leistner, C., Saffari, A., and Bischof, h.: Miforests: Multiple instance Learning with Randomized trees. in: Proc. eCCv (2010)

[39] Bjerring, L., frank, e.: Beyond trees: Adopting Miti to learn rules and ensemble classifiers for multi-instance data. in: D. Wang & M. Reynolds (eds.), Ai 2011, LNAi 7106 (pp. 41-50) (2011)

[40] Japkowicz, N., Shah, M.: evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press (2011)

[41] Shawe-taylor, J. and Cristianini, N.: further results on the margin distribution. in: Proceedings of the 12th Conference on Computational Learning theory, 278-285 (1999)

[42] Morik, K., Brockhausen, P., Joachims, t.: Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in intensive Care Monitoring. in: iCML: 268-277 (1999)

[43] veropoulos, K., Campbell, C., & Cristianini, N.: Controlling the sensitivity of support vector machines. in: Proceedings of the international Joint Conference on Artificial intelligence, 55–60. (1999)

[44] Chih-Chung Chang and Chih-Jen Lin.: LiBSvM : a library for support vector machines. in: ACM transactions on intelligent Systems and technology, 2:27:1-27:27, (2011)

[45] Chin-Wei hsu, Chih-Chung Chang and Chih-Jen Lin.: A practical guide to support vector classification. in: technical Report, National taiwan University. (2010)

[46] Bergstra, James; Bengio, Yoshua. :Random Search for hyper-Parameter optimization. in: J. Machine Learning Research 13: 281-305. (2012)

[47] Wang, X., Shao, h., Japkowicz, N., Matwin, S., Liu, X., Bourque, A., Nguyen, B.: Using SvM with Adaptively Asymmetric Misclassification Costs for Mine-Like objects Detection. in: iCMLA (2012)

[48] Wang, X., Matwin, S., Japkowicz, N., Liu, X.: Cost-Sensitive Boosting Algorithms for imbalanced Multi-instance Datasets. in: Canadian Conference on Ai (2013)

[49] hosmer, David W. Lemeshow, Stanley: Applied Logistic Regression.Wiley. iSBN 0-471-35632-8.(2000)

tABLe XiiCoMPARiSoN of ALL PReSeNteD ALgoRithMS foR CLASS iMBALANCeD PRoBLeM With Miti

Datasets Methods 𝑻𝑻𝑻𝑻𝑻𝑻𝒎𝒎𝒎𝒎𝒎𝒎 𝑻𝑻𝑻𝑻𝑻𝑻𝒎𝒎𝒎𝒎𝒎𝒎 Gmean Precision Recall F-measure

MLo_1

Base_Learner 7.8±2.0 97.8±0.2 27.5±3.6 76.6±5.0 41.4±6.9 53.7±7.1Bag_over-Sampling 18.4±4.6 90.8±1.3 40.9±5.1 65.6±5.9 64.7±7.2 65.2±6.5


MLo_2



MLo_3



MLo_4



tABLe XiiiCoMPARiSoN of ALL PReSeNteD ALgoRithMS foR CLASS iMBALANCeD PRoBLeM With DS oN DeCiSioN tRee


MLo_1



MLo_2



MLo_3Base_Learner 60.0±2.1 91.1±0.3 73.9±0.6 87.1±0.3 89.3±0.2 88.2±0.2

Bag_over-Sampling 63.1±1.6 88.3±0.9 74.6±0.4 84.4±0.2 90.4±0.4 87.3±0.1Adaboost 58.5±3.2 92.8±0.2 73.6±2.4 89.0±1.0 88.6±1.1 88.8±0.9

9

[37] foulds, J., frank, e.: Revisiting multiple-instance learning via embedded instance selection. in: W. Wobcke & M. Zhang(eds), 21st Australasian Joint Conference on Artificial intelligence Auckland, New Zealand, (pp. 300-310) (2008)

[38] Leistner, C., Saffari, A., and Bischof, h.: Miforests: Multiple instance Learning with Randomized trees. in: Proc. eCCv (2010)

[39] Bjerring, L., frank, e.: Beyond trees: Adopting Miti to learn rules and ensemble classifiers for multi-instance data. in: D. Wang & M. Reynolds (eds.), Ai 2011, LNAi 7106 (pp. 41-50) (2011)

[40] Japkowicz, N., Shah, M.: evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press (2011)

[41] Shawe-taylor, J. and Cristianini, N.: further results on the margin distribution. in: Proceedings of the 12th Conference on Computational Learning theory, 278-285 (1999)

[42] Morik, K., Brockhausen, P., Joachims, t.: Combining Statistical Learning with a Knowledge-Based Approach - A Case Study in intensive Care Monitoring. in: iCML: 268-277 (1999)

[43] veropoulos, K., Campbell, C., & Cristianini, N.: Controlling the sensitivity of support vector machines. in: Proceedings of the international Joint Conference on Artificial intelligence, 55–60. (1999)

[44] Chih-Chung Chang and Chih-Jen Lin.: LiBSvM : a library for support vector machines. in: ACM transactions on intelligent Systems and technology, 2:27:1-27:27, (2011)

[45] Chin-Wei hsu, Chih-Chung Chang and Chih-Jen Lin.: A practical guide to support vector classification. in: technical Report, National taiwan University. (2010)

[46] Bergstra, James; Bengio, Yoshua. :Random Search for hyper-Parameter optimization. in: J. Machine Learning Research 13: 281-305. (2012)

[47] Wang, X., Shao, h., Japkowicz, N., Matwin, S., Liu, X., Bourque, A., Nguyen, B.: Using SvM with Adaptively Asymmetric Misclassification Costs for Mine-Like objects Detection. in: iCMLA (2012)

[48] Wang, X., Matwin, S., Japkowicz, N., Liu, X.: Cost-Sensitive Boosting Algorithms for imbalanced Multi-instance Datasets. in: Canadian Conference on Ai (2013)

[49] hosmer, David W. Lemeshow, Stanley: Applied Logistic Regression.Wiley. iSBN 0-471-35632-8.(2000)

tABLe XiiCoMPARiSoN of ALL PReSeNteD ALgoRithMS foR CLASS iMBALANCeD PRoBLeM With Miti


MLo_1



MLo_2



MLo_3



MLo_4



tABLe XiiiCoMPARiSoN of ALL PReSeNteD ALgoRithMS foR CLASS iMBALANCeD PRoBLeM With DS oN DeCiSioN tRee


MLo_1



MLo_2



MLo_3Base_Learner 60.0±2.1 91.1±0.3 73.9±0.6 87.1±0.3 89.3±0.2 88.2±0.2

Bag_over-Sampling 63.1±1.6 88.3±0.9 74.6±0.4 84.4±0.2 90.4±0.4 87.3±0.1Adaboost 58.5±3.2 92.8±0.2 73.6±2.4 89.0±1.0 88.6±1.1 88.8±0.9

Wang X., Liu X., Japkowicz N. and Matwin S.

Table 14. Comparison of all presented algorithms for class imbalanced problem with DS on decision Tree

Datasets Methods TPRmin TNRmin Gmean Precision Recall F-measureMLO 1Base Learner 34.5±1.2 89.5±0.6 55.5±3.6 76.6±1.7 82.0±1.9 79.2±2.2Bag Over-Sampling 41.4±3.1 87.5±0.9 60.2±0.8 76.8±0.8 86.0±0.2 81.1±1.1








Table 15. Comparison using the statistical test method (sorted by score from high to low)

10

Datasets Methods 𝑻𝑻𝑻𝑻𝑻𝑻𝒎𝒎𝒎𝒎𝒎𝒎 𝑻𝑻𝑻𝑻𝑻𝑻𝒎𝒎𝒎𝒎𝒎𝒎 Gmean Precision Recall F-measureAb1 72.3±3.1 89.7±0.7 80.5±2.1 87.6±0.9 93.5±0.7 90.4±0.9Ab2 69.2±2.1 88.1±1.3 78.1±2.0 85.3±0.3 92.6±0.2 88.8±0.3Ab3 66.2±1.6 88.3±0.9 76.4±1.6 85.0±1.0 91.5±0.1 88.2±0.6Ab4 67.7±1.5 86.7±1.2 76.6±1.1 83.5±0.6 92.1±0.3 87.6±0.4

MLo_4



tABLe XivCoMPARiSoN USiNg the StAtiStiCAL teSt MethoD (SoRteD BY SCoRe fRoM high to LoW)

Base_Learner

Bag_Over-Sampling Adaboost Ab1 Ab2 Ab3 Ab4

Miti gmean×(f-measure) 0-0-6 1-2-3 1-2-3 4-2-0 1-2-3 4-2-0 4-2-0

Score -6 -2 -2 4 -2 4 4

DS+Decision tree gmean×(f-measure) 0-2-4 0-2-4 0-2-4 3-3-0 3-3-0 3-3-0 3-3-0Score -4 -4 -4 3 3 3 3

total score -10 -6 -6 7 1 7 7























10

Datasets Methods 𝑻𝑻𝑻𝑻𝑻𝑻𝒎𝒎𝒎𝒎𝒎𝒎 𝑻𝑻𝑻𝑻𝑻𝑻𝒎𝒎𝒎𝒎𝒎𝒎 Gmean Precision Recall F-measureAb1 72.3±3.1 89.7±0.7 80.5±2.1 87.6±0.9 93.5±0.7 90.4±0.9Ab2 69.2±2.1 88.1±1.3 78.1±2.0 85.3±0.3 92.6±0.2 88.8±0.3Ab3 66.2±1.6 88.3±0.9 76.4±1.6 85.0±1.0 91.5±0.1 88.2±0.6Ab4 67.7±1.5 86.7±1.2 76.6±1.1 83.5±0.6 92.1±0.3 87.6±0.4

MLo_4



tABLe XivCoMPARiSoN USiNg the StAtiStiCAL teSt MethoD (SoRteD BY SCoRe fRoM high to LoW)

Base_Learner

Bag_Over-Sampling Adaboost Ab1 Ab2 Ab3 Ab4

Miti gmean×(f-measure) 0-0-6 1-2-3 1-2-3 4-2-0 1-2-3 4-2-0 4-2-0

Score -6 -2 -2 4 -2 4 4

DS+Decision tree gmean×(f-measure) 0-2-4 0-2-4 0-2-4 3-3-0 3-3-0 3-3-0 3-3-0Score -4 -4 -4 3 3 3 3

total score -10 -6 -6 7 1 7 7

JAISCR, 2014, Vol. 4, No. 2, pp. 149Wang X., Liu X., Japkowicz N. and Matwin S.






















WEB–BASED FRAMEWORK FOR BREAST CANCERCLASSIFICATION

Tomasz Bruzdzinski1, Adam Krzyzak 2, Thomas Fevens 2 and Łukasz Jelen3

1Institute of Computer Engineering, Control and Robotics, Wrocław University of Technology,Wybrzeze Wyspianskiego 27, 50-370 Wrocław, Poland

2Department of Computer Science and Software Engineering, Concordia University,1455 De Maisonneuve Blvd. West, Montreal, Quebec, Canada H3G 1M8

3Institute of Computer Engineering, Control and Robotics, Wrocław University of Technology,Wybrzeze Wyspianskiego 27, 50-370 Wrocław, Poland

Abstract

The aim of this work is to create a web-based system that will assist its users in the cancerdiagnosis process by means of automatic classification of cytological images obtainedduring fine needle aspiration biopsy. This paper contains a description of the study on thequality of the various algorithms used for the segmentation and classification of breastcancer malignancy. The object of the study is to classify the degree of malignancy ofbreast cancer cases from fine needle aspiration biopsy images into one of the two classesof malignancy, high or intermediate. For that purpose we have compared 3 segmentationmethods: k-means, fuzzy c-means and watershed, and based on these segmentations wehave constructed a 25–element feature vector. The feature vector was introduced as aninput to 8 classifiers and their accuracy was checked.

The results show that the highest classification accuracy of 89.02 % was recorded forthe multilayer perceptron. Fuzzy c–means proved to be the most accurate segmentationalgorithm, but at the same time it is the most computationally intensive among the threestudied segmentation methods.

1 Introduction

Nowadays the mammary gland cancer is one themost common cancers present in the world [10]. InPoland alone the number of diagnosed cases basedon data delivered by the National Cancer Registryfor both male and female breast cancer for year2011 was 16643 [2]. Diagnosing cancer before itstarts to produce symptoms is an important matter.Mostly because cancers that are found when theyare already causing symptoms tend to be larger andare more likely to have already spread beyond thebreast. Therefore the treatment options are limitedsince such cancers are less responsive to any kindof therapy. In contrast, breast cancers which are di-agnosed earlier are more likely to be smaller, still

confined to the breast with many efficient treatmentoptions available.

The size and spread range are some of the mostimportant factors in predicting the outlook of a pa-tient’s survival. Nowadays there aren’t any fullyreliable, inexpensive nor non–invasive diagnosticmethods for the identification of breast pathology.

The most common diagnostic methods include:self-examination (palpation), mammography or ul-trasound imaging and fine needle aspiration biopsy(FNA), each of them differs in a degree of sensitiv-ity and invasiveness. FNA, being the most invasiveand the most accurate method, requires collectinga tissue material directly from a tumor for micro-scopic verification and examination in order to ex-

– 16210.1515/jaiscr-2015-0005

150 Bruzdzinski T., Krzyzak A., Fevens T. and Jelen Ł.

clude or confirm the presence of cancerous cells [2].

To propose a proper treatment there is a needfor an estimation of cancer stage and its malignancygrade. Cancer staging is a process of determiningthe size and metastasis of the cancer that associatesa stage to a case. The most commonly used stag-ing system for breast cancer nowadays is TNM (Tu-mor, Nodes, Metastasis) [3]. Besides staging, whenpredicting the progression of the cancer, it is essen-tial to estimate its malignancy grade. In this pa-per the scale proposed by Bloom and Richardson in1957 [7] is used to determine the malignancy grade.In this system tumor is assigned a low, intermedi-ate or high malignancy grade. In order to obtainthe resulting virulence the cells polymorphy, abilityto reform histoformative structures and mitotic in-dex needs to be evaluated. The evaluation processproposed by the Bloom-Richardson scheme utilizesthree factors that use a point based scale for assess-ing previously mentioned features. The malignancygrade is then assigned based on the value calculatedby summation of all points awarded for each fac-tor. This is a very difficult procedure that requiresextensive knowledge and experience of the cytolo-gist making a diagnosis. It is well known that usu-ally the human is the weakest link of any processas he tends to make mistakes, so the diagnosis isonly as good as the pathologist making it. In orderto minimize the human factor an automatic com-puter framework can be introduced which can as-sist doctors during the diagnostic process. Due tothe importance of a proper and accurate determina-tion of the breast cancer diagnosis many approachescan be found in the literature that tackle this prob-lem. One of them includes a firefly method for nu-clei detection [22] or even an approach that involvesthe analysis of thermograms [20]. In this paper wedeal with the classification of breast cancer basedon the fine needle aspiration biopsy. To the bestof our knowledge, the computerized breast cytol-ogy classification problem was first investigated byWolberg et al. in 1990 [32]. The authors describedan application of a multi-surface pattern separationmethod to cancer diagnosis. The proposed algo-rithm was able to distinguish between a 169 ma-lignant and 201 benign cases with 6.5% and 4.1%error rates, respectively depending on the size of thetraining set. When 50% of samples were used fortraining, the method returned a larger error. Using67% of sample images reduced the error to 4.1%.

The same authors introduced a widely used data-base of pre-extracted features of breast cancer nu-clei obtained from fine needle aspiration biopsy im-ages [24] (available as the Wisconsin Breast Can-cer Database (WBCD) at the UCI Machine Learn-ing Repository [1]). Later, in 1993, Street et al. [31]used an active contour algorithm, called ’snake’ forprecise nuclei shape representation. The authorsalso described 10 features of a nucleus used forclassification. They achieved a 97.3% classificationrate using a multi-surface method for classification.

Xiong et al. [33] used partial least squares re-gression was used to classify that WBCD databasewith 699 (241 malignant, 458 benign) cases witha 96.57% classification rate. Numerous other re-searchers have worked with the WBCD database(see [25] and reference therein) with resulting clas-sification rates ranging from 94.74% to 99.54%.

Malek et al. [23] used active contours to seg-ment nuclei and classified 200 (80 malignant, 120benign) cases using a fuzzy c-means classifierachieving a 95% classification rate.

Niwas et al. [27] presented a feature extractionmethod based on the analysis of nuclei Chromatintexture using a complex wavelet transform. Thesefeatures were used with a k-nearest neighbor clas-sifier where using a data set of 20 malignant and25 benign cases they achieved a classification rateof 93.33%. Filipczuk et al. [11] used a circu-lar Hough transform to detect cell nuclei, whichare subsequently classified as correct or not by anSVM. Using a k-nearest neighbor, naive Bayes, oran SVM classifier on selected features sets, using67 (42 malignant, 25 benign) cases, a classifica-tion rate of 98.51% was achieved. George et al.[14]) used a circular Hough transform to detect cellnuclei, confirming these nuclei using thresholdingand fuzzy c-means clustering. Twelve features werethen passed to several neural network architecturesusing 92 (47 malignant, 45 benign) cases (and theWBCD database for comparison) with the best re-sult being the probabilistic neural network with sen-sitivity of 95.49% and specificity of 83.16%.

It is important to notice that the above men-tioned approaches have concentrated on classify-ing FNA slides as benign or malignant and are alsocalled malignancy diagnosis. The system presentedin the current study classifies a malignancy stageof cancer, called malignancy grading. The biopsy

151Bruzdzinski T., Krzyzak A., Fevens T. and Jelen Ł.

clude or confirm the presence of cancerous cells [2].

To propose a proper treatment there is a needfor an estimation of cancer stage and its malignancygrade. Cancer staging is a process of determiningthe size and metastasis of the cancer that associatesa stage to a case. The most commonly used stag-ing system for breast cancer nowadays is TNM (Tu-mor, Nodes, Metastasis) [3]. Besides staging, whenpredicting the progression of the cancer, it is essen-tial to estimate its malignancy grade. In this pa-per the scale proposed by Bloom and Richardson in1957 [7] is used to determine the malignancy grade.In this system tumor is assigned a low, intermedi-ate or high malignancy grade. In order to obtainthe resulting virulence the cells polymorphy, abilityto reform histoformative structures and mitotic in-dex needs to be evaluated. The evaluation processproposed by the Bloom-Richardson scheme utilizesthree factors that use a point based scale for assess-ing previously mentioned features. The malignancygrade is then assigned based on the value calculatedby summation of all points awarded for each fac-tor. This is a very difficult procedure that requiresextensive knowledge and experience of the cytolo-gist making a diagnosis. It is well known that usu-ally the human is the weakest link of any processas he tends to make mistakes, so the diagnosis isonly as good as the pathologist making it. In orderto minimize the human factor an automatic com-puter framework can be introduced which can as-sist doctors during the diagnostic process. Due tothe importance of a proper and accurate determina-tion of the breast cancer diagnosis many approachescan be found in the literature that tackle this prob-lem. One of them includes a firefly method for nu-clei detection [22] or even an approach that involvesthe analysis of thermograms [20]. In this paper wedeal with the classification of breast cancer basedon the fine needle aspiration biopsy. To the bestof our knowledge, the computerized breast cytol-ogy classification problem was first investigated byWolberg et al. in 1990 [32]. The authors describedan application of a multi-surface pattern separationmethod to cancer diagnosis. The proposed algo-rithm was able to distinguish between a 169 ma-lignant and 201 benign cases with 6.5% and 4.1%error rates, respectively depending on the size of thetraining set. When 50% of samples were used fortraining, the method returned a larger error. Using67% of sample images reduced the error to 4.1%.

The same authors introduced a widely used data-base of pre-extracted features of breast cancer nu-clei obtained from fine needle aspiration biopsy im-ages [24] (available as the Wisconsin Breast Can-cer Database (WBCD) at the UCI Machine Learn-ing Repository [1]). Later, in 1993, Street et al. [31]used an active contour algorithm, called ’snake’ forprecise nuclei shape representation. The authorsalso described 10 features of a nucleus used forclassification. They achieved a 97.3% classificationrate using a multi-surface method for classification.

Xiong et al. [33] used partial least squares re-gression was used to classify that WBCD databasewith 699 (241 malignant, 458 benign) cases witha 96.57% classification rate. Numerous other re-searchers have worked with the WBCD database(see [25] and reference therein) with resulting clas-sification rates ranging from 94.74% to 99.54%.

Malek et al. [23] used active contours to seg-ment nuclei and classified 200 (80 malignant, 120benign) cases using a fuzzy c-means classifierachieving a 95% classification rate.

Niwas et al. [27] presented a feature extractionmethod based on the analysis of nuclei Chromatintexture using a complex wavelet transform. Thesefeatures were used with a k-nearest neighbor clas-sifier where using a data set of 20 malignant and25 benign cases they achieved a classification rateof 93.33%. Filipczuk et al. [11] used a circu-lar Hough transform to detect cell nuclei, whichare subsequently classified as correct or not by anSVM. Using a k-nearest neighbor, naive Bayes, oran SVM classifier on selected features sets, using67 (42 malignant, 25 benign) cases, a classifica-tion rate of 98.51% was achieved. George et al.[14]) used a circular Hough transform to detect cellnuclei, confirming these nuclei using thresholdingand fuzzy c-means clustering. Twelve features werethen passed to several neural network architecturesusing 92 (47 malignant, 45 benign) cases (and theWBCD database for comparison) with the best re-sult being the probabilistic neural network with sen-sitivity of 95.49% and specificity of 83.16%.

It is important to notice that the above men-tioned approaches have concentrated on classify-ing FNA slides as benign or malignant and are alsocalled malignancy diagnosis. The system presentedin the current study classifies a malignancy stageof cancer, called malignancy grading. The biopsy

WEB–BASED FRAMEWORK FOR . . .

being classified is nearly always malignant due tothe pre-screening process before taking an FNA.Henceforth, in this paper, we are studying malig-nancy grading, not malignancy diagnosis.

2 Proposed Framework

In Fig. 1 the general concept of the proposedsystem is presented. It can be divided into two parts,Browser part and server part.

– Browser part – Its main task is to provide userwith a set of operations which allow him to up-load images to be processed and review resultsof classification. Its secondary task is to senddata to a server in form of unprocessed imagesand retrieve processed results. It may be seen asa presentation layer of the system. The idea isthat it is user friendly and intuitive.

– Server part – It is a core layer of the system. Itperforms all computational tasks including nec-essary calculations and features extraction. Ithandles all data structures essential for properclassification process. It is also easily extend-able by other algorithms with visible separationbetween presentation, business and delegate lay-ers.

Figure 1. General concept of the automaticweb–based classification system.

The proposed web–based framework is dividedinto three stages. The first is FNA cytological im-

ages segmentation followed by the feature extrac-tion of the meaningful and indispensable featuresdescribing segmented nuclei. The output vector ofextracted features is then transferred to the last part,a classifier which classifies an image into one of thetwo possible malignancy classes.

3 Segmentation

In this paper the focus is put on two imageclustering segmentation algorithms and one regiongrowing technique supported by histogram thresh-olding. The algorithms that were applied for themalignancy classification include a fuzzy c–meansand k–means clustering as well as a watershed seg-mentation.

3.1 K–means clustering

One of the simplest unsupervised learning al-gorithms that solves clustering problem. The algo-rithm’s input parameter is only a number of inputclusters k which needs to be known before cluster-ing process can begin. The main idea is to definek centroids, one for each cluster, which should beplaced in cunning way since k means is a heuristicalgorithm and there is no guarantee that it will con-verge to global optimum. Next step is to take eachpoint belonging to a given data set and associateit to the nearest centroid. When all of the pointsare assigned, the first step of the algorithm is com-pleted and an initial grouping is done. Followingprocedure is to re-calculate k new centroids as cen-ters of the groups calculated initially. After that wehave new k centroids and the association procedurefor all of the data needs to be repeated [18]. Theconsecutive steps of a generated loop make the kcentroids change their location until convergence isreached. The aim of this algorithm is to minimizean objective function which is a squared error func-tion:

J =k

∑j=1

n

∑i=1

||x ji − c j||2 (1)

where ||x ji − c j||2 is a chosen distance measure

between a data point x ji and the cluster centre, c j is

an indicator of the distance of the n data points fromtheir respective cluster centers.

Figure 1: General concept of the automatic web–based classification system.

classification include a fuzzy c–means and k–meansclustering as well as a watershed segmentation.

3.1 K–means clustering

One of the simplest unsupervised learning algo-rithms that solves clustering problem. The algo-rithm’s input parameter is only a number of inputclusters k which needs to be known before cluster-ing process can begin. The main idea is to definek centroids, one for each cluster, which should beplaced in cunning way since k means is a heuristicalgorithm and there is no guarantee that it will con-verge to global optimum. Next step is to take eachpoint belonging to a given data set and associate itto the nearest centroid. When all of the points areassigned, the first step of the algorithm is completedand an initial grouping is done. Following procedureis to re-calculate k new centroids as centers of thegroups calculated initially. After that we have new kcentroids and the association procedure for all of the

data needs to be repeated [18]. The consecutive stepsof a generated loop make the k centroids change theirlocation until convergence is reached. The aim ofthis algorithm is to minimize an objective functionwhich is a squared error function:

J =k

∑j=1

n

∑i=1

||x ji − c j||2 (1)

where ||x ji − c j||2 is a chosen distance measure be-

tween a data point x ji and the cluster centre, c j is an

indicator of the distance of the n data points fromtheir respective cluster centers.

Here we use the RGB color distance between pixeland a mean cluster RGB color as a measure of dis-tance. The initial clusters centroids are picked in ran-dom fashion. The segmentation result is picked as acluster whose mean RGB value is the highest. Af-ter empirically testing different setting of clusters theconclusion was reached that the optimal number ofclusters is 3. Higher value of k resulted in dispatch-ing of a meaningful data, sometimes even creatingholes inside nuclei or jagged groups. When usingonly 2 clusters, too much meaningless data was in-troduced into a segmented image and the result wasnot satisfactory. Three clusters is a trade off betweenprocessing meaningless data and discarding poten-tially important information.

3.2 Fuzzy c–means clustering

Similarly to k–means, a fuzzy c–means is a methodof clustering but it allows one piece of data to be-long to two or more clusters in a fuzzy logic fashion.In this algorithm each point has a degree of clustermembership rather than completely belonging to justone cluster as in the k-means segmentation. Becauseof that, it is possible that the points on the edge ofthe cluster belongs to the cluster in a lesser degreethan those in the centre of it [4]. The method was

4


Here we use the RGB color distance betweenpixel and a mean cluster RGB color as a measure ofdistance. The initial clusters centroids are picked inrandom fashion. The segmentation result is pickedas a cluster whose mean RGB value is the highest.After empirically testing different setting of clustersthe conclusion was reached that the optimal numberof clusters is 3. Higher value of k resulted in dis-patching of a meaningful data, sometimes even cre-ating holes inside nuclei or jagged groups. Whenusing only 2 clusters, too much meaningless datawas introduced into a segmented image and the re-sult was not satisfactory. Three clusters is a tradeoff between processing meaningless data and dis-carding potentially important information.


Similarly to k–means, a fuzzy c–means is amethod of clustering but it allows one piece of datato belong to two or more clusters in a fuzzy logicfashion. In this algorithm each point has a degree ofcluster membership rather than completely belong-ing to just one cluster as in the k-means segmenta-tion. Because of that, it is possible that the pointson the edge of the cluster belongs to the cluster in alesser degree than those in the centre of it [4]. Themethod was developed by J. C. Dunn [8] and im-proved by J. C. Bezdek [5] and frequently used inpattern recognition. The objective of this algorithmis to minimize the following objective function:

J =N

∑i=1

C

∑j=1

umi j||xi − c j||2 (2)

where m is any real number greater than 1, ui j

is the degree of membership of xi in cluster j, xi isthe i–th dimension of the d–dimensional measureddata, c j is the d–dimension centre of the cluster and|| ∗ || is any norm expressing the similarity betweenany measured data and the centre.

Fuzzy partitioning is carried out by iterative op-timization of the objective function with update ofmembership ui j and the cluster centers c j. The iter-ations stops when the error of the result is lowerthan set accuracy, or the number of iterations al-ready computed is higher than maximum number ofiterations set. The parameters of the algorithm are:desired accuracy, maximum number of iterations,number of clusters, and m fuzzy parameter which

controls how much weight is given to the closestcentre and must be greater or equal to 1.

3.3 Watershed segmentation

The watershed algorithm exploits the propertiesof grey-level images in a way that they may be seenas topographic relief. The grey level of a pixel is in-terpreted as an altitude in the relief. High intensitydenotes peaks and hills while low intensity denotesvalleys. The main idea is that each isolated valley(local minima) of the image is filled with differentwater color (label). When the water level rises con-necting nearby peaks (gradients) it will merge withwater of other color. In order to prevent that, thebarriers are created in those locations where watermerges. The process of flooding and constructingof barriers continues until all of the peaks are underwater. Finally, the barriers created with the algo-rithm are a result of the segmentation process. Dueto a noise or local irregularities in the gradient im-ages it is common to over-segment an image [30].

In case of over-segmentation, which is a verycommon problem with watershed algorithm, resultswould not be very meaningful for a problem ofnuclei segmentation. For this purpose a variationof the watershed method, called marker-controlledwatershed, was applied. The principle here is thesame but instead of flooding from local minima, aset of markers which will most certainly belong to aforeground is used as points of origin. That way theover-segmentation is prevented. The input markersfor marker–controlled watershed are calculated ac-cording to the following:

1 Regions which will most certainly belong to aforeground are specified and labelled

2 Regions which will most certainly belong to aforeground or non-objects are specified and la-belled

3 Remaining regions which we are uncertain arelabelled

A process starts with the RGB image convertedto a greyscale using Otsu’s binarization. From theresult of Otsu’s binarization two images are created.The first is an image to which erosion was appliedin order to remove the boundary pixels. After that,in order to isolate a foreground region, the distance


Here we use the RGB color distance betweenpixel and a mean cluster RGB color as a measure ofdistance. The initial clusters centroids are picked inrandom fashion. The segmentation result is pickedas a cluster whose mean RGB value is the highest.After empirically testing different setting of clustersthe conclusion was reached that the optimal numberof clusters is 3. Higher value of k resulted in dis-patching of a meaningful data, sometimes even cre-ating holes inside nuclei or jagged groups. Whenusing only 2 clusters, too much meaningless datawas introduced into a segmented image and the re-sult was not satisfactory. Three clusters is a tradeoff between processing meaningless data and dis-carding potentially important information.


Similarly to k–means, a fuzzy c–means is amethod of clustering but it allows one piece of datato belong to two or more clusters in a fuzzy logicfashion. In this algorithm each point has a degree ofcluster membership rather than completely belong-ing to just one cluster as in the k-means segmenta-tion. Because of that, it is possible that the pointson the edge of the cluster belongs to the cluster in alesser degree than those in the centre of it [4]. Themethod was developed by J. C. Dunn [8] and im-proved by J. C. Bezdek [5] and frequently used inpattern recognition. The objective of this algorithmis to minimize the following objective function:

J =N

∑i=1

C

∑j=1

umi j||xi − c j||2 (2)

where m is any real number greater than 1, ui j

is the degree of membership of xi in cluster j, xi isthe i–th dimension of the d–dimensional measureddata, c j is the d–dimension centre of the cluster and|| ∗ || is any norm expressing the similarity betweenany measured data and the centre.

Fuzzy partitioning is carried out by iterative op-timization of the objective function with update ofmembership ui j and the cluster centers c j. The iter-ations stops when the error of the result is lowerthan set accuracy, or the number of iterations al-ready computed is higher than maximum number ofiterations set. The parameters of the algorithm are:desired accuracy, maximum number of iterations,number of clusters, and m fuzzy parameter which

controls how much weight is given to the closestcentre and must be greater or equal to 1.

3.3 Watershed segmentation

The watershed algorithm exploits the propertiesof grey-level images in a way that they may be seenas topographic relief. The grey level of a pixel is in-terpreted as an altitude in the relief. High intensitydenotes peaks and hills while low intensity denotesvalleys. The main idea is that each isolated valley(local minima) of the image is filled with differentwater color (label). When the water level rises con-necting nearby peaks (gradients) it will merge withwater of other color. In order to prevent that, thebarriers are created in those locations where watermerges. The process of flooding and constructingof barriers continues until all of the peaks are underwater. Finally, the barriers created with the algo-rithm are a result of the segmentation process. Dueto a noise or local irregularities in the gradient im-ages it is common to over-segment an image [30].

In case of over-segmentation, which is a verycommon problem with watershed algorithm, resultswould not be very meaningful for a problem ofnuclei segmentation. For this purpose a variationof the watershed method, called marker-controlledwatershed, was applied. The principle here is thesame but instead of flooding from local minima, aset of markers which will most certainly belong to aforeground is used as points of origin. That way theover-segmentation is prevented. The input markersfor marker–controlled watershed are calculated ac-cording to the following:

1 Regions which will most certainly belong to aforeground are specified and labelled

2 Regions which will most certainly belong to aforeground or non-objects are specified and la-belled

3 Remaining regions which we are uncertain arelabelled

A process starts with the RGB image convertedto a greyscale using Otsu’s binarization. From theresult of Otsu’s binarization two images are created.The first is an image to which erosion was appliedin order to remove the boundary pixels. After that,in order to isolate a foreground region, the distance


transform with a proper threshold was applied. Thesecond image created from Otsu’s thresholding out-put is a result of image dilation. With these twoimages, the points 1) and 2) above are completedand we can calculate remaining regions that can-not be associated to foreground nor background.The watershed algorithm is a solution to find them.These areas are normally around the boundaries of aforeground and background where the images meet.It can be obtained by subtracting these two areas.When the region labeling is done, the marker imageis ready and can be passed to the watershed algo-rithm along with original image for segmentation.

4 Classification

Data classification is a process of identifying towhich set of categories (sub-populations or classes)new observation belongs to. In image analysis itwould be an operation of assigning one set of cat-egories to a new image based on classification of afeature vector that was previously extracted basedon the segmentation results [9].

Classification is a process based on a trainingset of data containing observations (or instances)whose category membership are known [15]. Thismeans that in order to properly classify a new in-stance the classifier has to have some set of previ-ously made observations (for instances in form ofa database or a flat file) with instances that are abase of prediction. These individual observationsare analyzed into a set of quantifiable properties.The properties can be variously categorized, for ex-ample as NAO, OBO, NABO or N0O for blood type.Depending on the application, more common typeslike integer-values or real-valued can also be as-signed.

An algorithm that implements classification, es-pecially in a concrete implementation, is known asa classifier. The term ”classifier” sometimes alsorefers to the mathematical function, implementedby a classification algorithm that maps input datato a category.

In this study we have compared several classi-fiers which include Naıve Bayes classifier, Logisticregression, Decision Trees, Decision table and neu-ral networks.

4.1 Naıve Bayes classifier

Naıve Bayes classifier belongs to a family ofsimple probabilistic classifiers that are based on aBayes theorem assuming that there is a very strong(naive) independence between features. In otherwords, a naıve Bayes classifier assumes that a valueof one of the features is unrelated to the presenceor absence of any other feature, in scope of a classvariable. For instance, an orange may be consideredto be an orange if its color is orange, it is round,and about 3” in diameter. However this classifierconsiders each of these features to contribute in-dependently to the probability that this fruit is anorange, not taking into consideration the presenceor absence of other features. The advantage of thisapproach is that used in this way it only requiresa small amount of training data to estimate the pa-rameters necessary for classification. This is relatedto independency of variables that the algorithm as-sumes and we need to determine only the variancesof the variables for each class and not the entire co-variance matrix [26].

4.2 Logistic regression

This method is otherwise known as a logit re-gression which is a type of probabilistic statisticalclassification model used when categorical depen-dent variable (for instance a class label) can be onlyone of the two values (on dichotomous scale). Usu-ally values of features describing some observationare based on occurrence or absence of some eventthat is the topic of prediction. In such a case by us-ing logistic regression it is possible to calculate aprobability of such an event. Formally, logistic re-gression model is a general case of a linear modelin which the logit was used as a bounding function[26].

4.2.1 Logistic model trees

Logistic model trees are a type of a classifica-tion trees with logistic regression functions at theleaves. The only changed parameter, during gener-ation of the LMT, is a minimal number of instancesat which a node is considered for splitting and is setto 15.


4.2.2 Multinomial logistic regression

This is another classifier that uses logistic re-gression, a multinomial logistic regression in par-ticular, with a ridge estimator. It is it allowed toperform an unlimited number of iterations and thelog-likelihood value of ridge is set to 10−8.

4.3 Decision trees

Decision trees are a type of a predictive modelwhich maps observations of an item to conclusionsabout the item’s target value. In this approach forthe classification process a tree structure is usedwhere leaves represent class labels and branchesrepresent conjunctions of features that lead to thoseclass labels [19].

This tool allows for visual and explicit repre-sentation of decisions and decision making process(see fig. 2). An insight into what features are takeninto account during classification process and inwhat way can be made. Opposed to neural networksthe decision trees are human readable and the pro-cess of classification can be understood without anyproblems and neural networks are more like blackboxes.

Figure 2. Example of a decision tree. Taken fromWikipedia.

Here we applied a C4.5, PART and a decisionstump variants of a decision trees.

C4.5

This is an algorithm used to generate a decisiontree developed by Ross Quinlan [28]. The C4.5builds trees using the concept of information en-tropy. At each node of the tree, algorithm choosesthe attribute of the data that most effectively splits

its set of samples into subsets of each class. The cri-terion of splitting is the difference of entropy. Theattribute with the highest entropy is chosen to makea decision. The C4.5 then recurs on the smaller sub-sets of data. In this work the following parameterswere used for this algorithm:

– Confidence factor for pruning is equal to 0.25,

– Minimum number of instances per leaf is equalto 2,

– Number of data for reduced error pruning isequal to 3.

PART

PART is a decision list algorithm which usesdivide-and-conquer technique. Builds a partial C4.5decision tree in each iteration and makes the bestleaf into a rule. The decision list is a representationof Boolean functions [29]. The parameters for gen-erating the component decision trees of PART is thesame as for the C4.5 algorithm.

Decision stump

This ia a machine learning model consisting ofone-level decision tree. It makes predictions basedon the value of a single input feature [17].

Depending on the input feature there are twopossibilities for creating the stump:

– Creating a leaf for each possible feature value,

– Creating a leaf that corresponds to the one cho-sen category and the other leaf to all other cate-gories.

4.4 Decision table

Decision tables are a precise and compact wayof modelling complicated logic. They are similar toflowcharts and if-then-else set of statements whichassociate conditions with actions to be performed.Each decision corresponds to a variable, relation orpredicate whose possible values are listed amongthe condition alternatives. Decision table is a hier-archical breakdown of the data with two attributes ateach level of hierarchy. Decisions are made by theinducer the same way as in the decision tree, but theattributes are evaluated across the entire level of thetree rather than on a specific sub-tree. The result of

4.2.1 Logistic model trees

Logistic model trees are a type of a classificationtrees with logistic regression functions at the leaves.The only changed parameter, during generation ofthe LMT, is a minimal number of instances at whicha node is considered for splitting and is set to 15.


This is another classifier that uses logistic regression,a multinomial logistic regression in particular, with aridge estimator. It is it allowed to perform an unlim-ited number of iterations and the log-likelihood valueof ridge is set to 10−8.

4.3 Decision trees

Decision trees are a type of a predictive model whichmaps observations of an item to conclusions aboutthe item’s target value. In this approach for theclassification process a tree structure is used whereleaves represent class labels and branches representconjunctions of features that lead to those class la-bels [19].

This tool allows for visual and explicit represen-tation of decisions and decision making process (seefig. 2). An insight into what features are taken intoaccount during classification process and in whatway can be made. Opposed to neural networks thedecision trees are human readable and the process ofclassification can be understood without any prob-lems and neural networks are more like black boxes.Here we applied a C4.5, PART and a decision stumpvariants of a decision trees.

C4.5

This is an algorithm used to generate a decision treedeveloped by Ross Quinlan [28]. The C4.5 buildstrees using the concept of information entropy. At

Figure 2: Example of a decision tree. Taken fromWikipedia.

each node of the tree, algorithm chooses the attributeof the data that most effectively splits its set of sam-ples into subsets of each class. The criterion of split-ting is the difference of entropy. The attribute withthe highest entropy is chosen to make a decision. TheC4.5 then recurs on the smaller subsets of data. Inthis work the following parameters were used for thisalgorithm:

• Confidence factor for pruning is equal to 0.25,

• Minimum number of instances per leaf is equal to2,

• Number of data for reduced error pruning is equalto 3.

PART

PART is a decision list algorithm which uses divide-and-conquer technique. Builds a partial C4.5 deci-sion tree in each iteration and makes the best leafinto a rule. The decision list is a representation ofBoolean functions [29]. The parameters for gener-ating the component decision trees of PART is thesame as for the C4.5 algorithm.

7



This is another classifier that uses logistic re-gression, a multinomial logistic regression in par-ticular, with a ridge estimator. It is it allowed toperform an unlimited number of iterations and thelog-likelihood value of ridge is set to 10−8.

4.3 Decision trees

Decision trees are a type of a predictive modelwhich maps observations of an item to conclusionsabout the item’s target value. In this approach forthe classification process a tree structure is usedwhere leaves represent class labels and branchesrepresent conjunctions of features that lead to thoseclass labels [19].

This tool allows for visual and explicit repre-sentation of decisions and decision making process(see fig. 2). An insight into what features are takeninto account during classification process and inwhat way can be made. Opposed to neural networksthe decision trees are human readable and the pro-cess of classification can be understood without anyproblems and neural networks are more like blackboxes.

Figure 2. Example of a decision tree. Taken fromWikipedia.

Here we applied a C4.5, PART and a decisionstump variants of a decision trees.

C4.5

This is an algorithm used to generate a decisiontree developed by Ross Quinlan [28]. The C4.5builds trees using the concept of information en-tropy. At each node of the tree, algorithm choosesthe attribute of the data that most effectively splits

its set of samples into subsets of each class. The cri-terion of splitting is the difference of entropy. Theattribute with the highest entropy is chosen to makea decision. The C4.5 then recurs on the smaller sub-sets of data. In this work the following parameterswere used for this algorithm:

– Confidence factor for pruning is equal to 0.25,

– Minimum number of instances per leaf is equalto 2,

– Number of data for reduced error pruning isequal to 3.

PART

PART is a decision list algorithm which usesdivide-and-conquer technique. Builds a partial C4.5decision tree in each iteration and makes the bestleaf into a rule. The decision list is a representationof Boolean functions [29]. The parameters for gen-erating the component decision trees of PART is thesame as for the C4.5 algorithm.

Decision stump

This ia a machine learning model consisting ofone-level decision tree. It makes predictions basedon the value of a single input feature [17].

Depending on the input feature there are twopossibilities for creating the stump:

– Creating a leaf for each possible feature value,

– Creating a leaf that corresponds to the one cho-sen category and the other leaf to all other cate-gories.

4.4 Decision table

Decision tables are a precise and compact wayof modelling complicated logic. They are similar toflowcharts and if-then-else set of statements whichassociate conditions with actions to be performed.Each decision corresponds to a variable, relation orpredicate whose possible values are listed amongthe condition alternatives. Decision table is a hier-archical breakdown of the data with two attributes ateach level of hierarchy. Decisions are made by theinducer the same way as in the decision tree, but theattributes are evaluated across the entire level of thetree rather than on a specific sub-tree. The result of


course is presented as a hierarchical table instead ofa tree [13]. Parameters used for this algorithm areas follows:

– Number of folds for cross validation is equal to10,

– The method used for finding good attribute com-binations is BestFirst (greedy hillclimbing aug-mented with backtracking facility).

4.5 Neural networks

Here we have implemented a multilayer percep-tron which is a feedforward artificial neural networkmodel mapping sets of input data into a set of appro-priate outputs. Perceptron is a function that mapsinput feature vector (real–values) to an output valuef(x) (binary–values) [6]:

f (x) ={

1 if w∗ x+b > 00 otherwise

(3)

where w is a vector of real–valued weights samesize as the input features vector, w ∗ x is the dotproduct (weighted sum) and b is a bias – constantterm independent from any input value.

Multilayer perceptron utilizes a supervisedlearning technique called backpropagation for train-ing the network. Moreover it is a modification of astandard linear perceptron which is able to distin-guish data that is not linearly separable [16].

The parameters for the applied MLP are follow-ing:

– 3 hidden layers,

– Learning rate, which is the amount the weightsare updated, equals to 0.3,

– The momentum parameter, which is applied tothe weights during the update, equals to 0.2.

5 Data set and feature set

5.1 Data set

The database used in this paper consists of 346FNA images used for the breast cancer diagnosiswith known malignancy grade. All of the imagesare stained with the HE technique (Haematoxylinand Eosin) which stains nuclei with purple and

black color, cytoplasm with shades of pink and usesorange and red colors for red blood cells. At thispoint it has to be mentioned that the focus of thisstudy was to classify the malignancy of the breastcancer. This is due to the fact that tissue collectedduring the FNA examination is always cancerous.Therefore, there is no need to check if the case isbenign or malignant. It is more important to deter-mine cancer’s malignancy.

All of the images were digitalized with Olym-pus BX 50 microscope with mounted CCD-IRIScamera. The digitalization process was conductedat the Department of Pathology of the WrocławMedical University, Poland with a help of a PCclass computer with MultiScan Base 08.98 soft-ware. The images are recorded with a resolutionof 764x571 pixels with a printing density of 96 dpi.Because at the Wrocław Medical University, Polandthere was no low malignancy cases recorded since2004, our database consist only of images with high(G3) and medium (G2) malignancy samples, there-fore the classification is considering only these twocases.

5.2 Feature set

In order to obtain meaningful classification re-sults, a set of features needs to be calculated fromthe segmented images. In this section a list of ex-tracted parameters is discussed. To assure that theprocess of malignancy classification is performedonly on important and necessary features, a vectorof 25 features was built. The vector consists of bothlow and high magnification features (based on lowand high magnification images). In this study theextracted features chosen for classification processare a mixture of features introduced by authors of[21] and [12] as an attempt to create larger vectorutilizing advantages of both sets. In the end the fol-lowing set of features was extracted:

1 Low magnification features:

– Area of groups – average number of nu-clei pixels. This feature provides representa-tion of the tendency of groups to create largegroups. When this feature is large there isone couple of big groups present in the im-age.


– Number of groups – feature which de-termines the number of groups that werenot discarded during the image segmentationprocess. High value of this feature suggests alarge number of small groups in the image.

– Dispersion – statistical variation of clusterareas. Small values of this feature representsgroups of similar size present in the image.

2 High magnification features:

– Nuclei area – same as area of groups but forhigh magnification images.

– Perimeter of a nucleus –length of a nuclearenvelope. Computed as an average numberof pixels in the group which has at least oneneighboring pixel which is not a part of thatgroup.

– Convexity – ratio of the nucleus area andits convex hull (minimal area of the convexpolygon containing the nucleus).

– X–centroid – alias major axis length. Theaverage of longest diameter of a nuclei. Thelength of the nuclei along the x axis.

– Y–centroid – alias minor axis length. Theaverage of shortest diameter of a nuclei. Thelength of the nuclei along the y axis.

– Orientation – calculated from the binaryrepresentation of the nucleus and image mo-mentum.

– Vertical projection – average sum of all seg-mented pixels on y axis in horizontal direc-tion.

– Horizontal projection – the average sum ofall segmented pixels on x axis in vertical di-rection.

– Luminance mean – average luminance of allsegmented nuclei groups in the image.

– Luminance variance – statistical variationof luminance for each group.

– Eccentricity – measure of how much the nu-clei deviates from the circle. Calculated fromimage moments.

– Distance from weight centroid – for theneed of this feature a segmented image bi-nary centroid coordinates are calculated. Us-ing the coordinated a distance to each nu-clei is calculated as na average Euclidean dis-tance between the centroid and the nucleus.

– Distance from color centroid – calculatedas an average of the distance between thecolor cluster meant used during the segmen-tation and the subsequent groups of averagecolor.

3 Original image features:

– Histogram mean – set of three features ex-tracted as a histogram mean of a Red, Greenand Blue channels.

– Histogram energy – set of three featureswhere the histogram energy for each RGBchannel is calculated.

– Histogram variance – statistical variation ofhistogram mean. Calculated for each channelseparately.

6 Results

In this section we will present the results ob-tained in this study. The first set of results is devotedto segmentation. There are a couple of things thatcan be noticed based on segmentation results (seeFig. 3 and 4). First observation is that the water-shed segmentation algorithm, even with the mark-ers approach, is an algorithm with the least preci-sion for the task at hand. It discards whole mean-ingful elements in the image and even makes holesin the properly detected nuclei. Its usefulness forsegmenting low magnification images is question-able, however the output for the high magnificationimage is better but not ideal.

Another observation is that k–means and fuzzyc–means algorithms provided similar results. Thereason behind that is that they are based on the sameprinciple. However when taking a closer look atthe results it became obvious that fuzzy c–meansis slightly better and more accurate. It providesa less jagged borders than in the other two meth-ods. It is also better at recognizing similar partsin the original image where other two algorithmstend to classify background data as nuclei. This ismostly visible when low magnification results arecompared. K–means is classifying some parts of theimage which are a little darker than the surroundingpixels (but not being nuclei) while fuzzy c–meansproperly recognizes them as background and dis-cards them.


– Number of groups – feature which de-termines the number of groups that werenot discarded during the image segmentationprocess. High value of this feature suggests alarge number of small groups in the image.

– Dispersion – statistical variation of clusterareas. Small values of this feature representsgroups of similar size present in the image.

2 High magnification features:

– Nuclei area – same as area of groups but forhigh magnification images.

– Perimeter of a nucleus –length of a nuclearenvelope. Computed as an average numberof pixels in the group which has at least oneneighboring pixel which is not a part of thatgroup.

– Convexity – ratio of the nucleus area andits convex hull (minimal area of the convexpolygon containing the nucleus).

– X–centroid – alias major axis length. Theaverage of longest diameter of a nuclei. Thelength of the nuclei along the x axis.

– Y–centroid – alias minor axis length. Theaverage of shortest diameter of a nuclei. Thelength of the nuclei along the y axis.

– Orientation – calculated from the binaryrepresentation of the nucleus and image mo-mentum.

– Vertical projection – average sum of all seg-mented pixels on y axis in horizontal direc-tion.

– Horizontal projection – the average sum ofall segmented pixels on x axis in vertical di-rection.

– Luminance mean – average luminance of allsegmented nuclei groups in the image.

– Luminance variance – statistical variationof luminance for each group.

– Eccentricity – measure of how much the nu-clei deviates from the circle. Calculated fromimage moments.

– Distance from weight centroid – for theneed of this feature a segmented image bi-nary centroid coordinates are calculated. Us-ing the coordinated a distance to each nu-clei is calculated as na average Euclidean dis-tance between the centroid and the nucleus.

– Distance from color centroid – calculatedas an average of the distance between thecolor cluster meant used during the segmen-tation and the subsequent groups of averagecolor.

3 Original image features:

– Histogram mean – set of three features ex-tracted as a histogram mean of a Red, Greenand Blue channels.

– Histogram energy – set of three featureswhere the histogram energy for each RGBchannel is calculated.

– Histogram variance – statistical variation ofhistogram mean. Calculated for each channelseparately.

6 Results

In this section we will present the results ob-tained in this study. The first set of results is devotedto segmentation. There are a couple of things thatcan be noticed based on segmentation results (seeFig. 3 and 4). First observation is that the water-shed segmentation algorithm, even with the mark-ers approach, is an algorithm with the least preci-sion for the task at hand. It discards whole mean-ingful elements in the image and even makes holesin the properly detected nuclei. Its usefulness forsegmenting low magnification images is question-able, however the output for the high magnificationimage is better but not ideal.

Another observation is that k–means and fuzzyc–means algorithms provided similar results. Thereason behind that is that they are based on the sameprinciple. However when taking a closer look atthe results it became obvious that fuzzy c–meansis slightly better and more accurate. It providesa less jagged borders than in the other two meth-ods. It is also better at recognizing similar partsin the original image where other two algorithmstend to classify background data as nuclei. This ismostly visible when low magnification results arecompared. K–means is classifying some parts of theimage which are a little darker than the surroundingpixels (but not being nuclei) while fuzzy c–meansproperly recognizes them as background and dis-cards them.


What is also worth noticing is that the fuzzy c–means, despite being the best algorithm for segmen-tation out of the three investigated methods, is alsothe one which has the highest computational times.Somewhere around 45 seconds for an image of asize 764x571 pixels is really high compared to theother algorithms which take no longer than 5 sec-onds. For watershed segmentation we noted a timenot longer than 1 second. The difference in qual-ity between k–means and fuzzy c–means is reallysmall, but the gain on the performance when us-ing the k-means is really high. Here we checked ifclassification results support this conclusion. Thisis why we constructed a feature vector based onthe achieved segmentations. The feature that whereused to build the feature vector was described insection 5.2. In table 1 an example of such a vec-tor is presented for all the segmentation algorithms.In this case the same segmentation algorithm wasused to calculate both the low and high magnifica-tion features. From that table we can notice that wa-tershed algorithm is not the best choice for the taskof automatic segmentation. In comparison with theremaining two algorithms is significant. Anotherfact is that the fuzzy c–means is better at pickingclusters because the average distance from the RGBcentroid of particular groups in the segmented im-age and luminance variance of those groups is lessthan 10−3.

The last set of results to be presented in thiswork is the comparison of the accuracy of appliedclassification algorithms. Table 2 contains perfor-mance results of used classifiers for different com-binations of segmentation algorithms. The resultswere obtained by using the 10–fold cross validationtechnique which assesses how the results of statisti-cal analysis will generalize to new and independentdata set. Its purpose is to check the model againstthe overfitting problem which occurs when modelis too dependent on the training data set.

7 Conclusion

In this work three segmentation algorithms andclassifiers were compared for the problem of cre-ation of a web-based decision supporting systemfor automatic breast cancer malignancy grading.Whole process of decision making starting with im-age acquisition moving into image segmentation

and feature extraction, ending with classificationstep was described. Algorithms chosen for compar-ison are a result of scrupulous literature review andshowed to be very precise in the described appli-cation. The suggested feature vector obtained fromsegmented images allows for a high quality classi-fication of FNA breast cancer images.

The main conclusion is that the described ap-proach provided promising results. The error ratearound 15 % for most of the cases indicates that theproblem of automatic classification of breast cancerFNA images can be resolved by the proposed solu-tion but it still needs some improvements to be moreaccurate. Not all of the combinations and classifiershowever are as good as the others. The best twoare the multilayer perceptron and logistic regres-sion. Both are in a scope of minimizing the errorrate and providing the best prediction for G3 casesbeing classified as G3. Other algorithms were goodfor minimizing the error rate but due to the fact thatthe data was severely unbalanced (136 samples ofG2 and only 37 samples of G3) they mostly failedwith correct classification of G3 samples. The opti-mistic results were also obtained with the C4.5 de-cision tree algorithm which shows a room for futureimprovement.

For the segmentation task, the following com-binations of algorithms showed the best nuclei rep-resentation:

– Watershed for low magnification and k–meansfor high magnification (MLP - 89.02 %; Logis-tic regression 83.81 %)

– Fuzzy c–means for low magnification and k–means for high magnification (MLP - 88.44 %;Logistic regression 83.24 %)

– K–means for low magnification and fuzzy c–means for high magnification (MLP - 87.28 %;Logistic regression 84.97 %)

– Only k–means (MLP - 88.44 %; Logistic regres-sion 83.81 %)

– Only fuzzy c–means (MLP - 88.44 %; Logisticregression 86.70 %)

Another conclusion is that the increase of recogni-tion rate of minority class in almost all cases leadsto decreased accuracy for majority class objects.However as stated before the early detection of high


malignancy breast cancer is vital for the life of pa-tients and provides some means of efficient treat-ment. Therefore that trade-of is worth its cost.

The accuracy of the obtained results is verypromising despite the fact that the proposed meth-ods are not able to handle properly all of the testcases. The detection of overlapping nuclei in theimages could be improved as well as the bright-ness of the images. The segmentation quality couldalso improve by the introduction of pre-processingmethods to the input images which should resolvemost of the mentioned problems. Also a recogni-tion rate could be improved by the introduction ofmore attributes with high classification power to thefeature vector that were not tested in this study.

References[1] UCI machine learning repository.

[2] National Cancer Registry, The Maria Skłodowska– Curie memorial Cancer Center, Department ofEpidemiology and Cancer Prevetion, December2013.

[3] TNM breast cancer staging, December 2014.

[4] M.N. Ahmed, S.M. Yamany, N. Mohamed, A.A.Farag, and T. Moriarty. A modified fuzzy c-meansalgorithm for bias field estimation and segmenta-tion of mri data. IEEE Transactions on MedicalImaging, 21:193–199, 2002.

[5] J.C. Bezdek. Pattern Recognition with Fuzzy Ob-jective Function Algorithms. Plenum Press, NewYork, 1981.

[6] C.M. Bishop. Pattern Recognition and MachineLearning. Springer, 2006.

[7] H.J.G. Bloom and W.W. Richardson. Histologi-cal grading and prognosis in breast cancer. BritishJournal of Cancer, 11:359–377, 1957.

[8] J.C. Dunn. A fuzzy relative of the isodata processand its use in detecting compact well-separatedclusters. Journal of Cybernetics, 3:32–57, 1973.

[9] A. Ethem. Introduction to Machine Learning. MITPress, Boston, 2010.

[10] J. Ferlay, I. Soerjomataram, M. Ervik, R. Dik-shit, S. Eser, C. Mathers, M. Rebelo, D.M. Parkin,D. Forman, and F. Bray. Cancer incidence andmortality worldwide. IARC Cancer Base, No. 11,2012.

[11] P. Filipczuk, T. Fevens, A. Krzyzak, and R. Mon-czak. Computer-aided breast cancer diagnosis

based on the analysis of cytological images of fineneedle biopsies. IEEE Transactions on MedicalImaging, PP(99):1–1, 2013.

[12] P. Filipczuk, M. Kowal, and A. Obuchowicz.Fuzzy clustering and adaptive thresholding basedsegmentation method for breast cancer diagno-sis. Computer Recognition Systems, 4(5):613–622,2011.

[13] D.L. Fisher. Data, documentation and decision ta-bles. Comm ACM, 9(1):26–31, 1966.

[14] Y.M. George, H.H. Zayed, M.I. Roushdy, and B.M.Elbagoury. Remote computer-aided breast cancerdetection and diagnosis system based on cytolog-ical images. IEEE Systems Journal, PP(99):1–16,2013.

[15] T. Hastie, R. Tibshirani, and J. Friedman. The ele-ments of statistical learning, 2nd. edition. Springer,New York, 2009.

[16] S. Haykin. Neural Networks: A ComprehensiveFoundation. Prentice Hall, 1998.

[17] R.C. Holte. Very simple classification rules per-form well on most commonly used datasets. Ma-chine Learning, 11(1):63–90, 1993.

[18] T. Kanungo, D. M. Mount, N. Netanyahu, C. Pi-atko, R. Silverman, and A. Y. Wu. An efficientk-means clustering algorithm: Analysis and imple-mentation. In Proc. IEEE Conf. Computer Visionand Pattern Recognition, pages 881–892, 2002.

[19] S.B. Kotsiantis. Supervised machine learning: Areview of classification techniques. Informatica,pages 249–268, 2007.

[20] B. Krawczyk and P. Filipczuk. Cytological imageanalysis with firefly nuclei detection and hybridone–class classification decomposition. Engineer-ing Applications of Artificial Intelligence, 31:126–135, 2014.

[21] B. Krawczyk, Ł. Jelen, A. Krzyzak, and T. Fevens.Oversampling methods for classification of im-balanced breast cancer malignancy data. LectureNotes in Computer Science (LNCS), 7594:483–490, 2012.

[22] B. Krawczyk and G. Schaefer. A hybrid classi-fier committee for analysing asymmetry featuresin breast thermograms. Applied Soft Computing,20:112–118, 2014.

[23] Jihene Malek, Abderrahim Sebri, Souhir Mabrouk,Kholdoun Torki, and Rached Tourki. Automatedbreast cancer diagnosis based on gvf-snake seg-mentation, wavelet features extraction and fuzzyclassification. Journal of Signal Processing Sys-tems, 55(1-3):49–66, 2009.


malignancy breast cancer is vital for the life of pa-tients and provides some means of efficient treat-ment. Therefore that trade-of is worth its cost.

The accuracy of the obtained results is verypromising despite the fact that the proposed meth-ods are not able to handle properly all of the testcases. The detection of overlapping nuclei in theimages could be improved as well as the bright-ness of the images. The segmentation quality couldalso improve by the introduction of pre-processingmethods to the input images which should resolvemost of the mentioned problems. Also a recogni-tion rate could be improved by the introduction ofmore attributes with high classification power to thefeature vector that were not tested in this study.

References[1] UCI machine learning repository.

[2] National Cancer Registry, The Maria Skłodowska– Curie memorial Cancer Center, Department ofEpidemiology and Cancer Prevetion, December2013.

[3] TNM breast cancer staging, December 2014.

[4] M.N. Ahmed, S.M. Yamany, N. Mohamed, A.A.Farag, and T. Moriarty. A modified fuzzy c-meansalgorithm for bias field estimation and segmenta-tion of mri data. IEEE Transactions on MedicalImaging, 21:193–199, 2002.

[5] J.C. Bezdek. Pattern Recognition with Fuzzy Ob-jective Function Algorithms. Plenum Press, NewYork, 1981.

[6] C.M. Bishop. Pattern Recognition and MachineLearning. Springer, 2006.

[7] H.J.G. Bloom and W.W. Richardson. Histologi-cal grading and prognosis in breast cancer. BritishJournal of Cancer, 11:359–377, 1957.

[8] J.C. Dunn. A fuzzy relative of the isodata processand its use in detecting compact well-separatedclusters. Journal of Cybernetics, 3:32–57, 1973.

[9] A. Ethem. Introduction to Machine Learning. MITPress, Boston, 2010.

[10] J. Ferlay, I. Soerjomataram, M. Ervik, R. Dik-shit, S. Eser, C. Mathers, M. Rebelo, D.M. Parkin,D. Forman, and F. Bray. Cancer incidence andmortality worldwide. IARC Cancer Base, No. 11,2012.

[11] P. Filipczuk, T. Fevens, A. Krzyzak, and R. Mon-czak. Computer-aided breast cancer diagnosis

based on the analysis of cytological images of fineneedle biopsies. IEEE Transactions on MedicalImaging, PP(99):1–1, 2013.

[12] P. Filipczuk, M. Kowal, and A. Obuchowicz.Fuzzy clustering and adaptive thresholding basedsegmentation method for breast cancer diagno-sis. Computer Recognition Systems, 4(5):613–622,2011.

[13] D.L. Fisher. Data, documentation and decision ta-bles. Comm ACM, 9(1):26–31, 1966.

[14] Y.M. George, H.H. Zayed, M.I. Roushdy, and B.M.Elbagoury. Remote computer-aided breast cancerdetection and diagnosis system based on cytolog-ical images. IEEE Systems Journal, PP(99):1–16,2013.

[15] T. Hastie, R. Tibshirani, and J. Friedman. The ele-ments of statistical learning, 2nd. edition. Springer,New York, 2009.

[16] S. Haykin. Neural Networks: A ComprehensiveFoundation. Prentice Hall, 1998.

[17] R.C. Holte. Very simple classification rules per-form well on most commonly used datasets. Ma-chine Learning, 11(1):63–90, 1993.

[18] T. Kanungo, D. M. Mount, N. Netanyahu, C. Pi-atko, R. Silverman, and A. Y. Wu. An efficientk-means clustering algorithm: Analysis and imple-mentation. In Proc. IEEE Conf. Computer Visionand Pattern Recognition, pages 881–892, 2002.

[19] S.B. Kotsiantis. Supervised machine learning: Areview of classification techniques. Informatica,pages 249–268, 2007.

[20] B. Krawczyk and P. Filipczuk. Cytological imageanalysis with firefly nuclei detection and hybridone–class classification decomposition. Engineer-ing Applications of Artificial Intelligence, 31:126–135, 2014.

[21] B. Krawczyk, Ł. Jelen, A. Krzyzak, and T. Fevens.Oversampling methods for classification of im-balanced breast cancer malignancy data. LectureNotes in Computer Science (LNCS), 7594:483–490, 2012.

[22] B. Krawczyk and G. Schaefer. A hybrid classi-fier committee for analysing asymmetry featuresin breast thermograms. Applied Soft Computing,20:112–118, 2014.

[23] Jihene Malek, Abderrahim Sebri, Souhir Mabrouk,Kholdoun Torki, and Rached Tourki. Automatedbreast cancer diagnosis based on gvf-snake seg-mentation, wavelet features extraction and fuzzyclassification. Journal of Signal Processing Sys-tems, 55(1-3):49–66, 2009.


[24] O.L. Mangasarian, R. Setiono, and W.H. Wolberg.Pattern Recognition via Linear Programming: The-ory and Application to Medical Diagnosis. Large-Scale Num. Opt., Philadelphia: SIAM, pages 22–31, 1990.

[25] A. Marcano-Cedeno, J. Quintanilla-Domınguez,and D. Andina. WBCD breast cancer databaseclassification applying artificial metaplasticity neu-ral network. Expert Systems with Applications,38(8):9573 – 9579, 2011.

[26] T. Mitchell. Machine Learning, Generative andDiscriminative Classifiers: Naive Bayes and Lo-gistic Regression (Draft Version). McGraw Hill,2005.

[27] S.I. Niwas, P. Palanisamy, and K. Sujathan.Wavelet based feature extraction method for breastcancer cytology images. In IEEE Symposium onIndustrial Electronics Applications (ISIEA), pages686–690, Oct 2010.

[28] J.R. Quinlan. C4.5: Programs for Machine Learn-ing. Morgan Kaufmann Publishers, 1993.

[29] R.L. Rivest. Learning decision lists. MachineLearning, 2:229–246, 1987.

[30] J.B.T.M Roerdink and A. Meijster. The watershedtransform: definitions, algorithms, and paralleliza-tion strategies. Fundamenta Informaticae, 41:187–228, 2000.

[31] W.N. Street, W.H. Wolberg, and O.L. Mangasarian.Nuclear feature extraction for breast tumor diagno-sis. In IS&T/SPIE Inter. Symp. on Electronic Imag-ing: Science and Technology, volume 1905, pages861–870, 1993.

[32] W.H Wolberg and O.L. Mangasarian. MultisurfaceMethod of Pattern Separation for Medical Diagno-sis Applied to Breast Cytology. Proceedings of Na-tional Academy of Science, USA, 87:9193–9196,1990.

[33] Xiangchun Xiong, Yangon Kim, Yuncheol Baek,Dae Wong Rhee, and Soo-Hong Kim. Analysis ofbreast cancer using data mining & statistical tech-niques. In Proc. 6th Int. Conf. on Software En-gineering, Artificial Intelligence, Networking andParallel/Distributed Computing and 1st ACIS Int.Worksh. on Self-Assembling Wireless Networks,pages 82–87, 2005.


Figure 3. Low magnification segmentation results. a) Original image, b) K–means segmentation, c) Fuzzyc–means segmentation, d) Watershed segmentation.

Figure 4. High magnification segmentation results. a) Original image, b) K–means segmentation, c) Fuzzyc–means segmentation, d) Watershed segmentation

a) b)

c) d)

Figure 3: Low magnification segmentation results. a) Original image, b) K–means segmentation, c) Fuzzyc–means segmentation, d) Watershed segmentation.

a) b)

c) d)

Figure 4: High magnification segmentation results. a) Original image, b) K–means segmentation, c) Fuzzyc–means segmentation, d) Watershed segmentation

15

a) b)

c) d)

Figure 3: Low magnification segmentation results. a) Original image, b) K–means segmentation, c) Fuzzyc–means segmentation, d) Watershed segmentation.

a) b)

c) d)

Figure 4: High magnification segmentation results. a) Original image, b) K–means segmentation, c) Fuzzyc–means segmentation, d) Watershed segmentation

15


Figure 3. Low magnification segmentation results. a) Original image, b) K–means segmentation, c) Fuzzyc–means segmentation, d) Watershed segmentation.

Figure 4. High magnification segmentation results. a) Original image, b) K–means segmentation, c) Fuzzyc–means segmentation, d) Watershed segmentation


Table 1. Sample results of feature extraction for three segmentation algorithms.

Feature K–means Fuzzy c–means WatershedGroups area [px] 360.9 387.8 6893.2

Number of groups 118 110 5Dispersion 2137 2226 8010

Nuclei area [px] 1959.3 1836.8 3079.8Perimeter [px] 236.0 220.5 323.3

Convexity 0.918 0.927 0.873X–centroid 52.00 49.36 70.64Y–centroid 46.75 44.60 64.51Orientation 0.526 0.527 0.417

Vertical projection 82.64 80.70 178.62Horizontal projection 61.55 60.11 133.03

Luminance mean 153.42 146.95 171.17Luminance variance 10.54 0.00 20.36

Eccentricity 0.039 0.039 0.152Distance from centroid 363 369 359Histogram mean for:

R channel 246.2 246.2 246.2G channel 209.83 209.83 209.83B channel 199.18 199.18 199.18

Histogram variance for:R channel 21.93 21.93 21.93G channel 40.22 40.22 40.22B channel 28.57 28.57 28.57

Histogram energy for:R channel 0.22 0.22 0.22G channel 0.037 0.037 0.037B channel 0.014 0.014 0.014

Distance fromcentroid RGB 7 0 –

Computing time [ms] 1220 38593 156


Table 2. Error rates for different segmentation set–ups.

Segmentation C4.5 PART Decision Decision Multilayer LMT Logistic Naıveset–up table table perceptron Bayes

Low magnificationwatershed,

High magnification 16.18 % 15.61% 15.03 % 16.18 % 15.61 % 12.14 % 14.45 % 17.92 %Fuzzy c–means


High magnification 18.50 % 18.50% 16.19 % 17.34 % 10.98 % 11.56 % 16.18 % 17.34 %K–means

Low magnificationFuzzy c–means,

High magnification 16.76 % 20.23% 17.92 % 25.43 % 15.03 % 15.61 % 15.61 % 35.84 %watershed



Low magnificationK–means,




Low magnificationand



High magnification 17.92 % 18.50% 18.50 % 21.39 % 11.56 % 16.76 % 13.29 % 19.07 %Fuzzy c–meanss



Bruzdzinski T., Krzyzak A., Fevens T. and Jelen Ł.

Table 2. Error rates for different segmentation set–ups.

Segmentation C4.5 PART Decision Decision Multilayer LMT Logistic Naıveset–up table table perceptron Bayes
















High magnification 17.92 % 18.50% 18.50 % 21.39 % 11.56 % 16.76 % 13.29 % 19.07 %Fuzzy c–meanss



issn 2083-2567 - jaiscrjaiscr.eu/issuespdf/jaiscr_vol4_no2_2014.pdfscientific results and methods...

Documents