proceedings of ic2it2012

i

Table of Contents

Message from KMUTNB President ………………………………………………… ii Message from General Chair ……………………………………………………….. iii Conference Organizers ……………………………………………………………… iv Conference Organization Committee ………………………………………………. v Technical Program Committee ……………………………………………………... vi Keynote Speaker ……………………………………………………………………. vii Technical Program Contents ………………………………………………………... x Invited Papers …………………………………………………………………..…… 1 Regular Papers ……………………………………………………………………… 8 Author Index ………………………………………………………………………... 186

The Eighth International Conference on Computing and Information Technology IC2IT 2012

ii

Message from KMUTNB President

Nowadays, it is generally accepted that the development of a nation stems from technical advancements and this has become the main key factor which dictates the development of any country. Many issues can affect development such as international economics, highly competitive markets, social and cultural differences, and global environmental problems. Thus, strengthening a country’s capability to gain knowledge both physically and mentally provides the ability to deal with any critical issues, not only in an urban society but also surrounding countryside societies too. To ensure the overall stability of a country in the long term, technology is considered one of the most important mechanisms to support the management of educational quality and to provide the ability to continually improve it. This can be seen from many civilized countries that have invested resources and fundamental infrastructures for the deployment of Information Technology to their education system. The development of innovative teaching and learning that focuses on bringing Information Technology to the forefront of education benefits the population as a whole. The main goal is to improve the quality of life and progress evenly and equally to a better future. I would like to say a special thank you to everybody involved in this conference, from partners to stakeholders, without you this would not be possible. And I hope this conference provides a good opportunity for all your voices to be heard. (Professor Dr.Teravuti Boonyasopon) President, King Mongkut’s University of Technology North Bangkok


iii

Message from General Chair

Some of the most dramatic changes in the world are caused by the development of technology which continually changes our daily lives. The need for education, research, and development is necessary to understand the modem world we are now a part of. In particular, the computer and information technology field. The Faculty of Information Technology, KMUTNB will hold the 8th Conference in Computing and Information Technology between the 9th-10th of May 2012 at the Dusit Thani Hotel, Pattaya City, with the goal to serve as a platform to publish the findings of academic research in the field of Computers and Information Technology from students, professors, researchers, and general public. By the cooperation of local and international institutions including Fern University in Hagen (Germany), Oklahoma State University (USA), Chemnitz University of Technology (Germany), Edith Cowan University (Australia), National Taiwan University (Taiwan), Hanoi National University of Education (Vietnam), Nakhon Pathom Rajabhat University, Kanchanaburi Rajabhat University, Siam University, and Ubol Ratchathani University. Thank you to the president, CEO of King Mongkut’s University of Technology North Bangkok, and all involved organizations and committees who support and drive this conference to be successful.

(Associate Professor Dr.Monchai Tiantong)

General Chair


iv

Conference Organizers

King Mongkut’s University of Technology North Bangkok, Thailand

Fern University in Hagen, Germany

Oklahoma State University, USA

Edith Cowan University, Australia

National Taiwan University, Taiwan

Hanoi National University of Education, Vietnam

Mahasarakham University, Thailand

Kanchanaburi Rajabhat University, Thailand

Siam University, Thailand

Nakhon Pathom Rajabhat University, Thailand

Ubon Ratchathani University, Thailand

Chemnitz University of Technology, Germany


v

Conference Organization Committee

General Chair : Assoc.Professor Dr.Monchai Tiantong King Mongkut’s University of Technology North Bangkok Technical Program Chair : Prof.Dr.Herwig Unger Fern University in Hagen, Germany Conference Treasurer : Assist.Prof.Dr.Supot Nitsuwat King Mongkut’s University of Technology North Bangkok Secretary and Publication Chair : Assist.Prof.Dr.Phayung Meesad King Mongkut’s University of Technology North Bangkok


vi

Technical Program Committee

Prasong Praneetpolgrang, SPU, Thailand Roman Gumzej, University of Maribor, Slovenia Saowaphak Sasanus, TOT, Thailand Sirapat Boonkrong, KMUTNB, Thailand Somchai Prakarncharoen, KMUTNB, Thailand Soradech Krootjohn, KMUTNB, Thailand Suksaeng Kukanok, Thailand Surapan Yimman, KMUTNB, Thailand Sumitra Nuanmeesri, SSRU, Thailand Sunantha Sodsee, KMUTNB, Thailand Supot Nitsuwat, KMUTNB, Thailand Taweesak Ganjanasuwan, Thailand Tong Srikhacha, TOT, Thailand Tossaporn Joochim, UBU, Thailand Thibault Bernard, Uni Reims, France Thippaya Chintakovid, KMUTNB, Thailand Tobias Eggendorfer, Hamburg, Germany Thomas Böhme, TU Ilmenau, Germany Thomas Tilli, Telecom, Germany Dang Hung Tran, HNUE, Vietnam Ulrike Lechner, UniBw, Germany Uraiwan Inyaem, RMUTT, Thailand Wallace Tang, CityU Hongkong Wolfram Hardt, Chemnitz, Germany Winai Bodhisuwan, KU, Thailand Wongot Sriurai, UBU, Thailand Woraniti Limpakorn, TOT, Thailand

Alain Bui, Uni Paris 8, France Alisa Kongthon, NECTEC, Thailand Anirach Mingkhwan, KMUTNB, Thailand Apiruck Preechayasomboon, TOT, Thailand Armin Mikler, University of North Texas, USA Atchara Masaweerawat, UBU, Thailand Banatus Soiraya, Thailand Bogdan Lent, Lent AG, Switzerland Chatchawin Namman, UBU, Thailand Chayakorn Netramai, KMUTNB, Thailand Cholatip Yawut, KMUTNB, Thailand Choochart Haruechaiyasa, NECTEC, Thailand Claudio Ramirez, USL, Mexico Craig Valli, ECU, Australia Dietmar Tutsch, Wuppertal, Germany Doy Sundarasaradula, TOT, Thailand Dursun Delen, OSU, USA Gerald Eichler, Telecom, Germany Gerald Quirchmayr, UNIVIE, Austria Hsin-mu Tsai, NTU, Taiwan Ho Cam Ha, HNUE, Vietnam Jamornkul Laokietkul,CRU, Thailand Janusz Kacprzyk, Polish Academy of Science, Poland Jie Lu, Univ. of Technology, Sydney, Australia Kairung Hengpraphrom, NPRU, Thailand Kamol Limtunyakul, KMUTNB, Thailand Kriengsak Treeprapin, UBU, Thailand Kunpong Voraratpunya, KMITL, Thailand Kyandoghere Kyamakya, Klagenfurt, Austria Maleerat Sodanil, KMUTNB, Thailand Marco Aiello, Groningen, The Netherlands Mark Weiser, OSU, USA Martin Hagan, OSU, USA Mirko Caspar, Chemnitz, Germany Nadh Ditcharoen, UBU, Thailand Nawaporn Visitpongpun, KMUTNB, Thailand Nattavee Utakrit, KMUTNB, Thailand Nguyen The Loc, HNUE, Vietnam Nalinpat Porrawatpreyakorn, UNIVIE, Austria Padej Phomsakha Na Sakonnakorn, Thailand Parinya Sanguansat, PIM, Thailand Passakon Prathombutr, NECTEC, Thailand Peter Kropf, Neuchatel, Switzerland Phayung Meesad, KMUTNB, Thailand


vii

Keynote Speaker

Professor Dr. Martin Hagan School of Electrical and Computer Engineering

Oklahoma State University, USA

Topic : Dynamic Neural Networks : What Are They, and How Can We Use Them? Abstract : Neural networks can be classified into static and dynamic categories. In static networks, which are more commonly used, the output of the network is computed uniquely from the current inputs to the network. In dynamic networks, the output is also a function of past inputs, outputs or states of the network. This talk will address the theory and applications of this interesting class of neural network. Dynamic networks have memory, and therefore they can be trained to learn sequential or time-varying patterns. This has applications in such disparate areas as control systems, prediction in financial markets, channel equalization in communication systems, phase detection in power systems, sorting, fault detection, speech recognition, and even the prediction of protein structure in genetics. These dynamic networks are generally trained using gradient-based (steepest descent, conjugate gradient, etc.) or Jacobian-based (Gauss-Newton, Levenberg-Marquardt, Extended Kalman filter, etc.) optimization algorithms. The methods for computing the gradients and Jacobians fall generally into two categories: real time recurrent learning (RTRL) or backpropagation through time (BPTT). In this talk we will present a unified view of the training of dynamic networks. We will begin with a very general framework for representing dynamic networks and will demonstrate how BPTT and RTRL algorithms can be efficiently developed using this framework. While dynamic networks are more powerful than static networks, it has been known for some time that they are more difficult to train. In this talk, we will also investigate the error surfaces for these dynamic networks, which will provide interesting insights into the difficulty of dynamic network training.


viii

Keynote Speaker

Prof. Dr. rer.nat. Ulrike Lechner Institut für Angewandte Informatik Fakultät für Informatik Universität der Bundeswehr München Germany

Topic : Innovation Management and the IT-Industry – Enabler of innovations or truly innovative? Abstract : Who wants to be innovative? Who needs to be innovative? Everybody? - Innovation seems to be paramount in today´s economy and IT is an important driver for innovation. Think of EBusiness and all the consumer electronics. Can it be safely assumed that this industry is innovative and masters the art and science of innovation management? What about important business model trends of today, outsourcing and cloud computing, or the bread-and-butter business of the many IT-consulting companies? How important is innovation to them and how do they master innovations and do innovation management? Empirical data is rather inconclusive about business model innovations of the IT-industry and the need to be innovative. The talk reports on experiences in innovation management in the IT-industry and discusses awareness of innovation, innovation in services vs. product innovations and the various options to design innovation management. It provides an overview of the innovation landscape and ecosystems in the IT-industry as well as the theoretical background to analyze innovation networks. The talk discusses scientific approaches and open questions in this field.


ix

Keynote Speaker

Dr. Hsin-Mu Tsai Department of Computer Science and Information Engineering

National Taiwan University, Taiwan

Topic : Extend the Safety Shield - Building the Next Generation Vehicle Safety System Abstract : For the past decade, various safety systems have been realized by the car manufacturers to reduce the number of accidents drastically. However, the conventional approach has limited the classes of risks which can be detected and handled by the safety systems to those which have a line-of-sight path to the sensors installed on vehicles. In this talk, I will propose the next-generation vehicle safety system, which utilizes two fundamental technologies. The system gives out warnings for the vehicle or the driver to react to potential risks in a timely manner and extends the classes of risks which can be detected by vehicles from only “risks which have appeared” to also “risks not yet appear”. Conceptually, this increases the size of the “safety shield” of the vehicle, since most accidents caused by detectable risks could be avoided. I will also present the related research challenges in implementing such a system and some preliminary results from the measurements we carried out at National Taiwan University.


x

Technical Program Contents


xi

Wednesday May 9, 2012

8:00-9:00 Registration

9:00-9:30 Opening Ceremony by Prof.Dr. Teravuti Boonyasopon, President of King Mongkut’s University of Technology North Bangkok

9:30-10:30

Invited Keynote Speech by Prof.Dr. Martin Hagan, Oklahoma State University, USA Topic: Dynamic Neural Network: What are they, and How can we use them?

10:30-11:00 Coffee Break

11:00-12:00

Invited Keynote Speech by Prof.Dr.rer.nat. Ulrike Lechner, Universität der Bundeswehr München, Germany Topic: Innovation Management and The IT-Industry-Enabler of innovations or truly innovative?

12:00-13:00 Lunch

13:00-18:00 Parallel Session Presentation

18:00-22:00 Welcome Dinner

IC2IT 2012 Session I

Network & Security and Fuzzy Logic

Chair Session: Dr. Nawaporn Wisitpongphan

Time/Paper-ID Title/Author Page

13:00-13:20 IC2IT2012-71

Improving VPN Security Performance Based on One-Time Password Technique Using Quantum Keys Montida Pattaranantakul, Paramin Sangwongngam, and Keattisak Sripimanwat

8

13:20-13:40 IC2IT2012-33

Experimental Results on the Reloading Wave Mechanism for Randomized Token Circulation Boukary Ouedraogo, Thibault Bernard, and Alain Bui

14

13:40-14:00 IC2IT2012-107

Statistical-Based Car Following Model for Realistic Simulation of Wireless Vehicular Networks Kitipong Tansriwong and Phongsak Keeratiwintakorn

19

14:00-14:20 IC2IT2012-34

Rainfall Prediction in the Northeast Region of Thailand Using Cooperative Neuro-Fuzzy Technique Jesada Kajornrit, Kok Wai Wong, and Chun Che Fung

24

14:20-14:40 IC2IT2012-46

Interval-Valued Intuitionistic Fuzzy ELECTRE Method Ming-Che Wu and Ting-Yu Chen 30



xii

IC2IT 2012 Session II

Fuzzy Logic, Neural Network, and Recommendation Systems

Chair Session: Dr. Maleerat Sodanil


15.00-15.20 IC2IT2012-81

Optimizing of Interval Type-2 Fuzzy Logic Systems Using Hybrid Heruistic Algorithm Evaluated by Classification Adisak Sangsongfa and Phayung Meesad

36

15.20-15.40 IC2IT2012-60

Neural Network Modeling for an Intelligent Recommendation System Supporting SRM for Universities in Thailand Kanokwan Kongsakun, Jesada Kajornrit, and Chun Che Fung

42

15.40-16.00 IC2IT2012-44

Recommendation and Application of Fault Tolerance Patterns to Services Tunyathorn Leelawatcharamas and Twittie Senivongse

48

16.00-16.20 IC2IT2012-43

Development of Experience Base Ontology to Increase Competency of Semi-Automated ICD-10-TM Coding System Wansa Paoin and Supot Nitsuwat

54

16:20-16:30 Break

IC2IT 2012 Session III

Natural Language Processing and Machine Translation

Chair Session: Dr. Maleerat Sodanil

16:30-16:50 IC2IT2012-110

Collacation-Based Term Prediction for Academic Writing Narisara Nakmaetee, Maleerat Sodanil, and Choochart Haruechaiyasak

58

16:50-17:10 IC2IT2012-65

Thai Poetry in Machine Translation Sajjaporn Waijanya and Anirach Mingkhwan 64

17:10-17:30 IC2IT2012-45

Keyword Recommendation for Academic Publication Using Flexible N-gram Rugpong Grachangpun, Maleerat Sodanil, and Choochart Haruechaiyasak

70

17:30-17:50 IC2IT2012-70

Using Example-Based Machine Translation for English – Vietnamese Translation Minh Quang Nguyen, Dang Hung Tran, and Thi Anh Le Pham

75

18:00-22:00 Welcome Dinner


xiii

Thursday May 10, 2012

8:00-9:00 Registration

9:00-10:00

Invited Keynote Speech by Dr. Hsin-Mu Tsai, National Taiwan University, Taiwan Topic: Extend the Safety Shield - Building the Next Generation Vehicle Safety System



12:00-13:00 Lunch


IC2IT 2012 Session IV

Image Processing, Web Mining, Clustering, and e-Business

Chair Session: Prof.Dr. Herwig Unger


10:20-10:40 IC2IT2012-57

Cross-Ratio Analysis for Building up the Robustness of Document Image Watermark Wiyada Yawai and Nualsawat Hiransakolwong

81

10:40-11:00 IC2IT2012-73

PCA Based Handwritten Character Recognition System Using Support Vector Machine & Neural Network Ravi Sheth and Kinjal Mehta

87

11:00-11:20 IC2IT2012-68

Web Mining Using Concept-Based Pattern Taxonomy Model Sheng-Tang Wu, Yuefeng Li, and Yung-Chang Lin 92

11:20-11:40 IC2IT2012-59

A New Approach to Cluster Visualization Methods Based on Self-Organizing Maps Marcin Zimniak, Johannes Fliege, and Wolfgang Benn

98

11:40-12:00 IC2IT2012-74

Detecting Source Topics Using Extended HITS Mario Kubek and Herwig Unger 104

12:00-13:00 Lunch


xiv

IC2IT 2012 Session V

Evolutionary Algorithm, Heuristic Search, and Graphics Processing & Representation

Chair Session: Dr. Sunantha Sodsee


13:00-13:20 IC2IT2012-91

Blended Value Based e-Business Modeling Approach: A Sustainable Approach Using QFD Mohammed Dewan and Mohammed Quaddus

109

13:20-13:40 IC2IT2012-94

Protein Structure Prediction in 2D Triangular Lattice Model Using Differential Evolution Algorithm Aditya Narayan Hati, Nanda Dulal Jana, Sayantan Mandal, and Jaya Sil

116

13:40-14:00 IC2IT2012-48

Elimination of Materializations from Left/Right Deep Data Integration Plans Janusz Getta

121

14:00-14:20 IC2IT2012-24

A Variable Neighbourhood Search Heuristic for the Design of Codes Roberto Montemanni, Matteo Salani, Derek H. Smith, and Francis Hunt

127

14:20-14:40 IC2IT2012-63

Spatial Join with R-Tree on Graphics Processing Units Tongjai Yampaka and Prabhas Chongstitwattana 133


IC2IT 2012 Session VI

Web Services, and Ontology, and Agents

Chair Session: Dr. Sucha smanchat

15:00-15:20 IC2IT2012-41

Ontology Driven Conceptual Graph Representation of Natural Language Supriyo Ghosh, Prajna Devi Upadhyay, and Animesh Dutta

138

15:20-15:40 IC2IT2012-88

Web Services Privacy Measurement Based on Privacy Policy and Sensitivity Level of Personal Information Punyaphat Chaiwongsa and Twittie Senivongse

145

15:40-16:00 IC2IT2012-64

Measuring Granularity of Web Services with Semantic Annotation Nuttida Muchalintamolee and Twittie Senivongse

151

16:00-16:20 IC2IT2012-83

Decomposing Ontology in Description Logics by Graph Partitioning Pham Thi Anh Le, Le Thanh Nhan, and Nguyen Minh Quang

157

16:20-16:30 Break


xv


16:30-16:50 IC2IT2012-49

An Ontological Analysis of Common Research Interest for Researchers Nawarat Kamsiang and Twittie Senivongse

163

16:50-17:10 IC2IT2012-36

Automated Software Development Methodology: An Agent Oriented Approach Prajna Devi Upadhyay, Sudipta Acharya, and Animesh Dutta

169

17:10-17:30 IC2IT2012-53

Agent Based Computing Environment for Accessing Privileged Services Navin Agarwal and Animesh Dutta

176

17:30-17:50 IC2IT2012-52

An Interactive Multi-touch Teaching Innovation for Preschool Mathematical Skills Suparawadee Trongtortam, Peraphon Sophatsathit, and Achara Chandrachai

181


Dynamic Neural Networks: What Are They, and How Can We Use Them?

Martin Hagan School of Electrical and Computer Engineering,

Oklahoma State University, Stillwater, Oklahoma, 74078

[email protected]

Abstract—Neural networks can be classified into static and dynamic categories. In static networks, which are more commonly used, the output of the network is computed uniquely from the current inputs to the network. In dynamic networks, the output is also a function of past inputs, outputs or states of the network. This paper will address the theory and applications of this interesting class of neural network.

Dynamic networks have memory, and therefore they can be trained to learn sequential or time-varying patterns. This has applications in such disparate areas as control systems, prediction in financial markets, channel equalization in communication systems, phase detection in power systems, sorting, fault detection, speech recognition, and even the prediction of protein structure in genetics.

While dynamic networks are more powerful than static networks, it has been known for some time that they are more difficult to train. In this paper, we will also investigate the error surfaces for these dynamic networks, which will provide interesting insights into the difficulty of dynamic network training.

I. INTRODUCTION Dynamic networks are networks that contain delays (or

integrators, for continuous-time networks). These dynamic networks can have purely feedforward connections, or they can also have some feedback (recurrent) connections. Dynamic networks have memory. Their response at any given time will depend not only on the current input, but also on the history of the input sequence.

Because dynamic networks have memory, they can be trained to learn sequential or time-varying patterns. This has applications in such diverse areas as control of dynamic systems [1], prediction in financial markets [2], channel equalization in communication systems [3], phase detection in power systems [4], sorting [5], fault detection [6], speech recognition [7], learning of grammars in natural languages [8], and even the prediction of protein structure in genetics [9].

Dynamic networks can be trained using standard gradient-based or Jacobian-based optimization methods. However, the gradients and Jacobians that are required for these methods cannot be computed using the standard backpropagation algorithm. In this paper we will discuss a general dynamic network framework, in which dynamic backpropagation algorithms can be efficiently developed.

There are two general approaches (with many variations) to gradient and Jacobian calculations in dynamic networks: backpropagation-through-time (BPTT) [10] and real-time recurrent learning (RTRL) [11]. In the BPTT algorithm, the network response is computed for all time points, and then the gradient is computed by starting at the last time point and working backwards in time. This algorithm is computationally efficient for the gradient calculation, but it is difficult to implement on-line, because the algorithm works backward in time from the last time step.

In the RTRL algorithm, the gradient can be computed at the same time as the network response, since it is computed by starting at the first time point, and then working forward through time. RTRL requires more calculations than BPTT for calculating the gradient, but RTRL allows a convenient framework for on-line implementation. For Jacobian calculations, the RTRL algorithm is generally more efficient than the BPTT algorithm [12,13].

In order to more easily present general BPTT [10, 15] and RTRL [11, 14] algorithms, it will be helpful to introduce modified notation for networks that can have recurrent connections. In Section II, we will introduce this notation and will develop a general dynamic network framework. As a general rule, there have been two major approaches to using dynamic training. The first approach has been to use the general RTRL or BPTT concepts to derive algorithms for particular network architectures. The second approach has been to put a given network architecture into a particular canonical form (e.g., [16-18]), and then to use the dynamic training algorithm which has been previously designed for the canonical form. Our approach is to develop a very general framework in which to conveniently represent a large class of dynamic networks, and then to derive the RTRL and BPTT algorithms for the general framework

In Section III, we will demonstrate how this general dynamic framework can be applied to solve many real-world problems. Section IV will present procedures for computing gradients for the general framework. In this way, one computer code can be used to train arbitrarily constructed network architectures, without requiring that each architecture be first converted to a particular canonical form. Finally, Section V describes some complexities in the error surfaces of dynamic


1

networks, and shows how we can mitigate these complexities to achieve successful training for dynamic networks.

II. A GENERAL CLASS OF DYNAMIC NETWORK Our general dynamic network framework is called the

Layered Digital Dynamic Network (LDDN) [12]. The fundamental building block for the LDDN is the layer. A layer contains the following components:

• a set of weight matrices (input weights from external inputs, and layer weights from the outputs of other layers),

• tapped delay lines that appear at the input of a weight matrix,

• bias vector,

• summing junction,

• transfer function.

A prototype layer is shown in Fig. 1. The equations that define a layer response are

nm t( ) = IWm,l d( )pl t − d( )d∈DIm ,l∑

l∈Im∑

+ LWm,l d( )al t − d( )d∈DLm ,l∑

l∈L fm

∑ + bm (1)

am t( ) = f m nm t( )( ) (2)

where Im is the set of indices of all inputs that connect to layer

m, is the set of indices of all layers that connect forward to

layer m., is the lth input to the network, is the input

weight between input l and layer m, is the layer weight between layer l and layer m, is the bias vector for layer m, DLm,l is the set of all delays in the tapped delay line between Layer l and Layer m, DIm,l is the set of all delays in the tapped delay line between Input l and Layer m. For the LDDN class of networks, we can have multiple weight matrices associated with each layer - some coming from external inputs, and others coming from other layers. An example of a dynamic network in the LDDN framework is shown in Fig. 2.

The LDDN framework is quite general. It is equivalent to the class of general ordered networks discussed in [10] and [19]. It is also equivalent to the signal flow graph class of networks used in [15] and [20]. However, we can increase the generality of the LDDN further. In LDDNs, the weight matrix multiplies the corresponding vector coming into the layer (from an external input in the case of IW, and from another layer in the case of LW). This means that a dot product is formed between each row of the weight matrix and the input vector.

Figure 1. Example Layer

Figure 2. Example Dynamic Network in the LDDN Framework

We can consider more general weight functions than simply the dot product. For example, radial basis layers compute the distances between the input vector and the rows of the weight matrix. We can allow weight functions with arbitrary (but differentiable) operations between the weight matrix and the input vector. This enables us to include higher-order networks as part of our framework.

Another generality we can introduce is for the net input function. This is the function that combines the results of the weight function operations with the bias vector. In LDDNs, the net input function has been a simple summation. We can allow arbitrary, differentiable net input functions to be used.

The resulting network framework is the Generalized LDDN (GLDDN). A block diagram for a simple GLDDN (without delays) is shown in Fig. 3. The equations of operation for a GLDDN are

Weight Functions:

izm,l t,d( ) = ihm,l IWm,l d( ),pl t − d( )( ) (3)

lzm,l t,d( ) = lhm,l LWm,l d( ),al t − d( )( ) (4)

S1x 1

S2x 1

S3x 1

S1x 1

S2x 1 S

3x 1

S1x 1 S

2x 1 S

3x 1

R x 11

S1x R S

2x S

1S

3x S

2

S1

S2

S3

n1( )t

n2( )t n

3( )t

p1( )t

a1( )t

a2( )t a

3( )t

IW1,1

LW1,3

LW2,3

LW1,1

LW2,1

LW3,2

b1

b2

b31 1 1

R1

Inputs Layer 1 Layer 2 Layer 3

TDL

TDL

TDL

TDL

f1

f2

f3


2

Net Input Functions:

nm t( ) = om izm,l t,d( ) l∈Imd∈DIm ,l

, lzm,l t,d( ) l∈Lmf

d∈DLm ,l

,bm⎛

⎝⎜⎜

⎞

⎠⎟⎟

(5)

Transfer Functions:

am t( ) = f m nm t( )( ) (6)

Figure 3. Example Network with General Weight and Net Input Functions

III. APPLICATIONS OF DYNAMIC NETWORKS Dynamic networks have been applied to a wide variety of

application areas. In this section, we would like to give just a brief overview of some of these.

A. Phase Detection in Power Systems Voltage phase and local frequency deviation are used in

disturbance monitoring and control for power systems. Modern power electronic devices introduce complex interharmonics, which make it difficult to extract the phase. The dynamic neural network shown in Fig. 4 has been used [4] to detect phase in power systems. The input to the network is the line voltage:

p t( ) = A t( )sin 2π fct +φ t( )( ) + v t( ) .

The target output is the phase φ t( ) . The equations of operation for the network are

n1 t( ) = IW1,1 d( )p1 t − d( )d∈DIm ,l∑ +LW1,1 1( )a1 t −1( ) + b1

a1 t( ) = f1 n1 t( )( )

Figure 4. Phase Detection Network for Power Systems

B. Speech Prediction Predictive coding of speech signals is commonly used for

data compression. The standard method has used Linear Predictive Coding (LPC). Neural networks allow the use of nonlinear predictive coding. Fig. 5 shows a pipeline recurrent neural network [21], which can be used for speech prediction. The target output would be the next value of the input sequence.

Figure 5. Speech Predition Network

C. Channel Equalization The performance of a communication system can be

seriously impaired by channel effects and noise. These may cause the transmitted signal of one symbol to spread out and overlap successive symbol intervals - commonly termed Intersymbol Interference. Dynamic neural networks, like the one in Fig. 4, can be used to perform channel equalization, to compensate for the effects of the channel [3]. Fig. 6 shows the block diagram of such a system.

Figure 6. Channel Equalization System

D. Model Reference Control Dynamic networks are suitable for many types of control

systems. Fig. 7 shows the architecture of a model reference control system [14].

Figure 7. Model Reference Control System

S1x 1 S

2x 1

S1x 1 S

2x 1

S1x 1 S

2x 1

R x 11

S1

S2

n1

n2

p1

iz1,1

lz2,1

a1

a2

IW1,1

LW2,1

b1

b21 1

R1

f1

f2

ih1,1

lh2,1

o1

o2

r(t)

a3 (t)

1

1

n1(t)

n3(t)LW3,2

b1

IW1,1

b3

f2

f1

f3

TDL

LW1,2

y(t)TDL

LW1,4

TDL

LW3,4

TDL

LW4,3

b4

f4

1

a4 (t)

Neural Network Plant ModelNeural Network Controller

n4(t)

a2 (t)

1

LW2,1

b2

f2Plant

TDL

ep(t)

ec(t)

c(t)n2(t)


3

E. Grammatical Inference Grammars are a way to define languages. They consist of

rules that describe how to construct valid strings. Dynamic neural networks can be trained to recognize which strings belong to a language and which don’t. Dynamic networks can also perform grammatical inference - learning a grammar from example strings. Fig. 8 shows a dynamic network that can be used for grammatical inference [8]. The error function is defined by a single output neuron. At the end of each string presentation it should be 1 if the string is valid and 0 if not.

Figure 8. Grammar Inference Network

F. Protein Folding Each gene within the DNA molecule codes for a protein.

The amino acid sequence (A,T,G,C) determines the protein structure (e.g., secondary structure = helix, strand, coil). However, the relationship between the sequence and the structure is very complex. In the network in Fig. 9, the sequence is provided at the input to the network, and the output of the network indicates the secondary structure [9].

Figure 9. Protein Structure Identification Network

IV. GRADIENT CALCULATION FOR THE GLDDN Dynamic networks are generally trained with a gradient or

Jacobian-based algorithm. In this section we describe an algorithm for computing the gradient for the GLDDN. This can be done using the BPTT or the RTRL approaches. Because of limited space, we will describe only the RTRL algorithm in this paper. (Both approaches are described for the LDDN framework in [12].)

To explain the gradient calculation for the GLDDN, we must create certain definitions. We do that in the following paragraphs.

A. Preliminary Definitions First, as we stated earlier, a layer consists of a set of

weights, associated weight functions, associated tapped delay lines, a net input function, and a transfer function. The network has inputs that are connected to special weights, called input weights. The weights connecting one layer to another are called layer weights. In order to calculate the network response in stages, layer by layer, we need to proceed in the proper layer order, so that the necessary inputs at each layer will be available. This ordering of layers is called the simulation order. In order to backpropagate the derivatives for the gradient calculations, we must proceed in the opposite order, which is called the backpropagation order.

In order to simplify the description of the gradient calculation, some layers of the GLDDN will be assigned as network outputs, and some will be assigned as network inputs. A layer is an input layer if it has an input weight, or if it contains any delays with any of its weight matrices. A layer is an output layer if its output will be compared to a target during training, or if it is connected to an input layer through a matrix that has any delays associated with it.

For example, the LDDN shown in Fig. 2 has two output layers (1 and 3) and two input layers (1 and 2). For this network the simulation order is 1-2-3, and the backpropagation order is 3-2-1. As an aid in later derivations, we will define U as the set of all output layer numbers and X as the set of all input layer numbers. For the LDDN in Fig. 3, U=1,3 and X=1,2.

B. Gradient Calculation The objective of training is to optimize the network

performance, quantified in the performance index F(x), where x is a vector containing all of the weights and biases in the network. In this paper we will consider gradient-based algorithms for optimizing the performance (e.g., steepest descent, conjugate gradient, quasi-Newton, etc.). For the RTRL approach, the gradient is computed using

, (7)

where

∂au t( )∂xT

=∂eau t( )∂xT

+∂eau t( )∂nx t( )T

∂enx t( )∂a ′u t − d( )T

∂a ′u t − d( )∂xTd∈DLx , ′u

∑x∈X∑

′u ∈U∑

(8)

The superscript e in these expressions indicates an explicit derivative, not accounting for indirect effects through time.

Many of the terms in Eq. 8 will be zero and need not be included. To take advantage of these efficiencies, we introduce the following definitions


4

ELWU x( ) = u ∈U ∍ ∃ LWx,u ≠ 0( ) , (9)

ESX u( ) = x ∈X ∍ ∃ Sx,u ≠ 0( ) , (10)

where

(11)

is the sensitivity matrix.

Using these definitions, we can rewrite Eq. 8 as

∂au t( )∂xT

=∂eau t( )∂xT

+ Sx,u t( ) ∂enx t( )∂a ′u t − d( )T

∂a ′u t − d( )∂xTd∈DLx , ′u

∑′u ∈ELW

U x( )∑

x∈ESX u( )∑

(12)

The sensitivity matrix can be computed using static backpropagation, since it describes derivatives through a static portion of the network. The static backpropagation equation is

Su,m t( ) = Su,l t( ) ∂enl t( )

∂lz l ,m t,0( )T∂e lz l ,m t( )∂am t( )Tl∈Lb

m∩ES u( )∑

⎡

⎣⎢⎢

⎤

⎦⎥⎥Fm nm t( )( ) ,

u ∈U , (13)

where m is decremented from u through the backpropagation order, Lb

m is the set of indices of layers that are directly connected backwards to layer m (or to which layer m connects forward) and that contain no delays in the connection, and

. (14)

There are four terms in Eqs. 12 and 13 that need to be computed:

, , , and . (15)

The first term can be expanded as follows:

(16)

The first term on the right of Eq. 16 is the derivative of the net input function, which is the identity matrix if the net input is the standard summation. The second term is the derivative of the weight function, which is the corresponding weight matrix if the weight function is the standard dot product. Therefore, the right side of Eq. 16 becomes simply a weight matrix for LDDN networks.

The second term in Eq. 15 is the same as the first term on the right of Eq. 16. It is the derivative of the net input function. The third term in Eq. 15 is the same as the second term on the right of Eq. 16. It is the derivative of the weight function.

The final term that we need to compute is the last term in Eq. 15, which is the explicit derivative of the network outputs with respect to the weights and biases in the network. One element of that matrix can be written

(17)

The first term in this summation is an element of the sensitivity matrix, which is computed using Eq. 13. The second term is the derivative of the net input, and the third term is the derivative of the weight function. (We have made the assumption here that the net input function operates on each element individually.) Eq. 17 is the equation for an input weight. Layer weights and biases would have similar equations.

This completes the RTRL algorithm for networks that can be represented in the GLDDN framework. The main steps of the algorithm are Eqs. 7 and 12, where the components of Eq. 12 are computed using Eqs. 16 and 17. Computer code can be written from these equations, with modules for weight functions, net input functions and transfer functions added as needed. Each module should define the function response, as well as its derivative. The overall framework is independent of the particular form of these modules.

V. TRAINING DIFFICULTIES FOR DYNAMIC NETWORKS From the previous section on dynamic network

applications, it is clear that these types of networks are very powerful and have many uses. However, they have not yet been adopted comprehensively. The main reason for this is the difficulty in training these types of networks. The reasons for these difficulties are not completely understood, but it has been shown that one of the reasons is the existence of spurious valleys in the error surfaces of these networks. In this section, we will provide a quick overview of the causes of these spurious valleys and suggestions for mitigating their effects.

Fig. 10 shows an example of spurious valleys in the error surface of a neural network model reference controller (as shown in Fig. 7). In this particular example, the network had 65 weights. The plot shows the error surface along the direction of search in a particular iteration of a quasi-Newton optimization algorithm. It is clear from this profile that any standard line search, using a combination of interpolation and sectioning, will have great difficulty in locating the minimum along the search direction. There are many local minima contained in


5

very narrow valleys. In addition, the bottoms of the valleys are often cusps. Even if our line search were to locate the minimum, it is not clear that the minimum represents an optimal weight location. In fact, in the remainder of this section, we will demonstrate that spurious minima are introduced into the error surface due to characteristics of the input sequence.

Figure 10. Example of Spurious Valleys

In order to understand the spurious valleys in the error surfaces of dynamic networks it is best to start with the simplest network for which such valleys will appear. We have found that these valleys even appear in a linear network with one neuron, as shown in Fig. 11.

Figure 11. Single Neuron Recurrent Network

In order to generate an error surface, we first develop training data using the network of Fig. 11, where both weights are set to 0.5. We use a Gaussian white noise input sequence with mean zero and variance one for p(t), and then use the network to generate a sequence of outputs. In Fig. 12 we see a typical error surface, as the two weights are varied. Although this network architecture is simple, the error surfaces generated by these networks have spurious valleys similar to those encountered in more complicated networks.

The two valleys in the error surface occur for two different reasons. One valley occurs along the line w1=0. If this weight is zero, and the initial condition is zero, the output of the network will remain zero. Therefore, our mean squared error will be

constant and equal to the mean square value of the target outputs.

Figure 12. Single Neuron Network Error Surface

To understand where the second valley comes from, consider the network response equation:

a t +1( ) = w1p t( ) +w2a t( )

If we iterate this equation from the initial condition a(0), we get

a t( ) = w1p t( ) +w2p t −1( ) + w2( )2 p t − 2( ) +…

+ w2( )t−1 p 1( )+ w2( )t a 0( )

Here we can see that the response at time t is a polynomial in the parameter w2 . (It will be a polynomial of degree t-1, if the initial condition is zero.) The coefficients of the polynomial involve the input sequence and the initial condition. We obtain the second valley because this polynomial contains a root outside the unit circle. There is some value of w2 that is larger than 1 in magnitude for which the output is almost zero.

Of course, having a single output close to zero would not produce a valley in the error surface. However, we discovered that once the polynomial shown above has a root outside the unit circle at time t, that same root also appears in the next polynomial at time t+1, and therefore, the output will remain small for all future times for the same weight value.

Fig. 13 shows a cross section of the error surface presented in Fig. 12 for w1=0.5 using different sequence lengths. The error falls abruptly near w2=-3.8239. That is the root of the polynomial described above. The root maintains its location as the sequence increases in length. This causes the valley in the error surface.

We have since studied more complex networks, with nonlinear transfer functions and multiple layers. The number of spurious valleys increases in these cases, and they become more complex. However the causes of the valleys remain similar. They are affected by initial conditions and roots of the

ï2 ï1 0 1 2

1

2

3

4

5

Distance Along Search Direction

Sum

Squ

ared

Err

or

ï10ï5

05

10 ï10

ï5

0

5

10

100

1020

w1

w2

Sum

Squ

are

Erro

r


6

input sequence (or subsequence). This leads to several procedures for improving the training for these networks.

The first training modification is to switch training

sequences often during training. If training is becoming trapped in a spurious valley, the valley will move when the training sequence is changed. Also, since some of the valleys are affected by the choice of initial condition, a second modification is to use small random initial conditions for neuron outputs and change them periodically during training. A further modification is to use a regularized performance index to force weights into the stable region. Since the deep valleys occur in regions where the network is unstable, we can avoid the valleys by maintaining a stable network. We generally decay the regularization factor during training, so that the final weights will not be biased.

VI. CONCLUSIONS Dynamic neural networks represent a very powerful

paradigm, and, as we have shown in this paper, they have a very wide variety of applications. However, they have not been as widely implemented as their power would suggest. The reason for this discrepancy is related to the difficulties in training these networks. The first obstacle in dynamic network training is the calculation of training gradients. In most cases, the gradient algorithm is custom designed for a specific network architecture, based on the general concepts of BPTT or RTRL. This creates a barrier to using dynamic networks. We propose a general dynamic network framework, the GLDDN, which encompasses almost all dynamic networks that have been proposed. This enables us to have a single code to calculate gradients for arbitrary networks, and reduces the initial barrier to using dynamic networks. The second obstacle to dynamic network training relates to the complexities of their error surfaces. We have described some of the mechanisms that cause these complexities – spurious valleys. We have also shown how to modify training algorithms to avoid these spurious valleys. We hope that these new developments will encourage the increased adoption of dynamic neural networks.

REFERENCES [1] Hagan, M., Demuth, H., De Jesús, O., “An Introduction to the Use of

Neural Networks in Control Systems,” invited paper, International

Journal of Robust and Nonlinear Control, Vol. 12, No. 11 (2002) pp. 959-985.

[2] Roman, J. and Jameel, A., “Backpropagation and recurrent neural networks in financial analysis of multiple stock market returns,” Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences, vol. 2, (1996) pp. 454-460.

[3] Feng, J., Tse, C.K., Lau, F.C.M., “A neural-network-based channel-equalization strategy for chaos-based communication systems,” IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 50, no. 7 ( 2003) pp. 954-957.

[4] Kamwa, I., Grondin, R., Sood, V.K., Gagnon, C., Nguyen, V. T., Mereb, J., “Recurrent neural networks for phasor detection and adaptive identification in power system control and protection,” IEEE Transactions on Instrumentation and Measurement, vol. 45, no. 2, (1996) pp. 657-664.

[5] Jayadeva and Rahman, S.A., “A neural network with O(N) neurons for ranking N numbers in O(1/N) time,” in IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 51, no. 10, (2004) pp. 2044-2051.

[6] Chengyu, G. and Danai, K., “Fault diagnosis of the IFAC Benchmark Problem with a model-based recurrent neural network,” in Proceedings of the 1999 IEEE International Conference on Control Applications, vol. 2, (1999) pp. 1755-1760.

[7] Robinson, A.J., “An application of recurrent nets to phone probability estimation,” in IEEE Transactions on Neural Networks, vol. 5, no. 2 (1994).

[8] Medsker, L.R. and Jain, L.C., Recurrent neural networks: design and applications, Boca Raton, FL: CRC Press (2000).

[9] Gianluca, P., Przybylski, D., Rost, B., Baldi, P., “Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles,” in Proteins: Structure, Function, and Genetics, vol. 47, no. 2 , (2002) pp. 228-235.

[10] Werbos, P. J., “Backpropagation through time: What it is and how to do it,” Proceedings of the IEEE, vol. 78, (1990) pp. 1550–1560.

[11] Williams, R. J. and Zipser, D., “A learning algorithm for continually running fully recurrent neural networks,” Neural Computation, vol. 1, (1989) pp. 270–280.

[12] De Jesús, O., and Hagan, M., “Backpropagation Algorithms for a Broad Class of Dynamic Networks,” IEEE Transactions on Neural Networks, Vol. 18, No. 1 (2007) pp. 14 -27.

[13] De Jesús, O., Training General Dynamic Neural Networks, Doctoral Dissertation, Oklahoma State University, Stillwater OK, (2002).

[14] Narendra, K. S. and Parthasrathy, A. M., Identification and control for dynamic systems using neural networks, IEEE Transactions on Neural Networks, Vol. 1, No. 1 (1990) pp. 4-27.

[15] Wan, E. and Beaufays, F., “Diagrammatic Methods for Deriving and Relating Temporal Neural Networks Algorithms,” in Adaptive Processing of Sequences and Data Structures, Lecture Notes in Artificial Intelligence, Gori, M., and Giles, C.L., eds., Springer Verlag (1998).

[16] Dreyfus, G., Idan, Y., “The Canonical Form of Nonlinear Discrete-Time Models,” Neural Computation 10, 133–164 (1998).

[17] Tsoi, A. C., Back, A., “Discrete time recurrent neural network architectures: A unifying review,” Neurocomputing 15 (1997) 183-223.

[18] Personnaz, L. Dreyfus, G., “Comment on ‘Discrete-time recurrent neural network architectures: A unifying review’,” Neurocomputing 20 (1998) 325-331.

[19] Feldkamp, L.A. and Puskorius, G.V., “A signal processing framework based on dynamic neural networks with application to problems in adaptation, filtering, and classification,” Proceedings of the IEEE, vol. 86, no. 11 (1998) pp. 2259 - 2277.

[20] Campolucci, P., Marchegiani, A., Uncini, A., and Piazza, F., “Signal-Flow-Graph Derivation of On-line Gradient Learning Algorithms,” Proceedings of International Conference on Neural Networks ICNN'97 (1997) pp.1884-1889.

[21] Haykin, S. and L. Li, "Nonlinear adaptive prediction of nonstationary signals", IEEE Trans. Signal Process., vol. 43, no. 2, pp.526 -535 1995.

Lo

gS

um

Sq

uar

eE

rro

r

w2


7

Improving VPN Security Performance Based onOne-Time Password Technique Using Quantum Keys

Montida Pattaranantakul, Paramin Sangwongngam and Keattisak SripimanwatOptical and Quantum Communications Laboratory

National Electronics and Computer Technology CenterPathumthani, Thailand

[email protected], [email protected], [email protected]

Abstract—Network encryption technology has become an essentail factor for organizational security. Virtual Private Network (VPN) or VPN encryption technology is the most popular technique use to prevent unauthorized users access to private network. This technique normally rely on mathematical function in order to generate periodic key. As the result, it may decrease security performance and vulnerable system, if high performance computing make rapid progress to reverse mathematical calculation to find out the next secret key pattern. The main contribution of this paper emphasizes on improving VPN performance by adopting quantum keys as a seed value into one-time password technique to encompasses the whole process of authentication, data confidentiality and security key management methodology in order to protect against eavesdroppers during data transmission over insecure network.

Keywords- Quantum Keys, One-Time-Password , Virtual Private Network

I. INTRODUCTION

The evolution of information technologies has been growing rapidly in order to meet human communication need today. In which the security of data transmission has always been concerned to transfer information from sender to receiver over internet channel in a secure manner. Addressing on the network security issues are the main priority concern to protect against unauthorized users since the security technique should also cover data integrity, confidentiality, authorization and further non-repudiation services.

The lack of adequate knowledge with well known understanding of software architecture and security engineering leads to security vulnerabilities due to the eavesdroppers migh be able to gain information by monitoring the transmission for pattern of communication, the capability to detect data packets during transmission over internet, or enable to access information within private data storage that may lead to the occurance of data loss and data corruption. This is the critical factor cause to new threads arise and may effect business objectives change. In terms of worst case scenario this will definitely affect to organizational stability, business opportunities and then may become a national security threat. For this reason, many organizations have to pay attention in order to find out the way to protect their information from eavesdroppers based on security technology solutions that agile enough to adapt itself and combat with an existing threats due to security breach.

Therefore, data reliability and security protection are primary concern for information exchange through unprotected network connections in order to verify user, since only an authorized user can entrance and ability to govern the resource access, while encryption technology is also required for further data protection.

Presently, there are several types of cryptography [1] that have been used to achieve comprehensive data protection based on proven standard technology due to it is the most important aspect of network and communication security which provide as a basic building block for computer security. According end to end security encryption typically rely on application layer closest to the end user thus only data is encrypted. While, network security encryption where IPsec comes into play to encompasses confidentiality area by encapsulating security payload in both transport mode operation and tunnel mode operation through this type of encryption the entire IP packet including headers and payload are encrypted. IPsec encryption based on Virtual Private Network technology [2] presents an alternative approach for network encryption since it fully provide trusted collaboration framwork to be able to communicate each other over private network. Nevertheless, user authentication mechanism, cryptographic algorithms, key exchange procedure and traffic selector information need to be configured and maintained among two endpoints in order to establish VPN trusted tunnel before data transmission begins.

Although, the widespread usage of classical VPN can improve data transfer rate with maximum throughput, minimum delay and well guranteed on non bottleneck occurrence due to every communication routes is built the shortest part communication with independent IPsec to improve elastic traffic performance. In contrast, key exchange procedure during VPN setup is still a major hurdle process of vulnerability, if either secret key get trapped or key pattern broken up. In addtion, most of the random numbers have been used as the secret keys into cryptographic algorithms derived based on mathematical functions. This key generation manner is the one of the potential security vulnerabilities for data communications when computer technology become a high performance computing such make rapid progress to reverse mathematical calculation to find out secret key value.

One-time password mechanism [3] using quantum keys as a seed value into hash function can be solved a traditional VPN security problem in which it can eliminate the


8

spoofing attack caused by an eavesdropper has successfully masquerades as another by falsifying data and thereby gaining an illegitimate advantage. The main contribution of this paper emphasizes on improving VPN security performance that adopted one-time password technique to generate corresponding once symmetric key upon a time for further VPN tunnel establishment as using quantum keys as a seed value. Thus, the two endpoints are typically authenticated themselves in a secure manner process which rely on confidential protection. While, quantum keys have been proposed to avoid repeating the same password several times due to traditional password creation was derived from mathematical calculation may lead to system vulnerabilities. Addressing on quantum keys bring perfectly security enhancement of password generation due to the beauty of quantum key distribution (QKD) [4] promises to revolutionize secure communication by providing security based on the fundamental laws of physics [5], instead of the current state of mathematical algorithms or computing technology [6].

This paper is organized as follows. Section II an overview of VPN architecture and mechanism where the theory has been applied upon design processes, technical solution and implementation approach. In section III gives a details view to design a VPN security architecture for further VPN tunnel establishment. Since, the entire information are transferred through this corresponded tunnel regarding to the authorization control. Section IV comparison and analysis of an existing VPN security method and proposed challenge idea will be disscussed. Finally, some concluding remarks and future works are mentioned in section V.

II. VPN ARCHITECTURE AND MECHANISM

Fortunately, there are several network encryption technologies that have been used to protect private information from eavesdroppers over insecure network. At the current VPN encryption technology has become an attractive choice and widely used for protection against network security attacks.

VPN encryption mechanism normally process as Client/Server operation in order to establish a direct tunnel between source address and destination address while the virtual private network is built up. All data packet are consecutively passed over VPN tunnel. Due to the merit of VPN technology can reduce network cost consumption cause from physical leased lines, so that the users can exchange such private information with high data protection and trust. In addition, VPN architecture is encompassed based on authentication [7], confidentiality and key management functional areas. According to authentication service is typically used to control the users when entrance into the system, only authorized user able to do forward to encrypt a tunnel during process of VPN connection start up. As the result, the authentication header is inserted between the original IP header and the new IP header shown in figure 1. Next, confidentiality service provide message encryption to prevent eavesdropping by third parties. Finally, Key

management service has been concerned in order to handle a model of secure keys exchange protocol.

III. DESIGNING A NEW VPN SECURITY ARCHITECTURE

Basically, VPN tunnel encryption can be classified into two main methods. There are public key encryption method and symmetric key encryption method. The paper has been addressed only on symmetric key encryption that adopted one-time password mechanism such one time key encryption is used over time when VPN connection start up and finally destroy the keys when disconnection. Since, the one-time keys is originated from quantum keys as a seed value into hash function [8][9]. The mechanism is covered both user authentication process and tunnel establishment in order to prevent against data integrity problem. Therefore, the overview of VPN security architecture are mainly divided into three major modules.

A. User Registration Module

In order to improve VPN security performance such a connection, user registration module must required for either first time entry or password is expired. This module will be activated when new users enroll to the system to request legitimate password. The figure 2 show user registration procedure that each idividual process are explained as follow. The result of this step will indicate the corresponding password to those users. Such the password will be essentail used in the step of user authentication and negotiation.

1) New user login/ Password expired: This case can be occurred with two reasons. When new users who need to register into the server want to ask for the legitimate password, or their password are expired due to it exceeded the password life time. So, the registration phase will be activated to regenerated a new password.

2) Request for the password: The users transfer his/her identity information including an official name, identification number/passport number, date of birth, address and so on to the server in order to request a legitimate password. The correspondence user

Figure 1. The scenario of VPN encryption techniques


9

information will be stored in the user right database for further reference.

3) Random a unique 10 digit code: Generate a legitimate password based on random selection from Quantum Key Distribution (QKD) device.

4) Store username and password: The username and the legitimate password where created from QKD device called quantum keys are fabricated into the basic form of hash function. Therefore, only the corresponding username and the password hash value will be stored in the user right database wherewith the server does not also know the exact password value.

5) Transfer the legitimate password: The legitimate password will transfer back to the corresponding user across trusted channel to avoid password attack.

6) Treated as confidential information: Thus, the legitimate pasword will be used to verify user him/herself in authentication phase whether the user is authorized to perform VPN establishment.

B. User Authentication Mechanism Module

User authentication procedure maintains high level of security through one-site checking. Before creating VPN tunnel, the users must verify themselves with the server by logging into the system with corresponding username and password when the users had been registered at the beginning of the registration phase. The stages of the process are similar to S/key authentication mechanism [10] show in the figure. 3. Hence, only the authorized user can go forward to create a secure VPN tunnel. The user authentication procedures can be explained as follows.

1) Logging into the system: After user registration phase finished, if the user wish to create secure VPN tunnel then authentication phase will proceed respectively. Since, the username and the password must register to the server for the authentication whether a user is authorized to perform a task.

2) Received username and password: At the start service mode begins, the server is waiting for user call. When

the server get the signal of user authentication, the username and the password will be temporal stored as the input of hash function.

3) Password expiration checking: According to this function will examine the password life cycle due to the password life time is exceeded than the permitted allowance which may decrease security performance. Therefore, password expiration checking procedure was introduced to avoid against password attack.

4) Alert the password is expired: The password expiration result will send back to the corresponding user to notify whether password is valid or invalid. Invalid password result will return back to user registration phase in order to re-enrollment again, otherwise continue to the next procedure.

5) Computing password hash value: Computing the password hash value with the corresponding username and the password obtained from the previous process.

6) Comparing with an existing password hash valule: Comparing the calcuated password hash valule with an existing password hash value in the user right database.

7) Alert username and password are invalid: The comparision result will acknowledgement back to the corresponding user such only authorized user able to continue performing VPN tunnel establishment .

C. VPN Tunnel Establishment Based On One-Time Password Mechanism Module

The proposed technique has adopted two unique features of one-time password mechanism and the promise of quantum key exchange. One-time password mechanism which each password is used only once upon a time and frequenly updated when a new connection has been established to eliminate the possibility of attack that may come from replay attack, spoofing attack or birthday attack. In addition, using one-time password mechanism based on hash chain are more elegant design and attractive properties such to be able to reach the high security performace.

Figure 2. User rigistration phaseFigure 3. User authentication phase


10

While, applying the quantum keys as a seed value into a hash function will improve more efficiency and security against the system. The fascination of quantum technology that uses polarization property to ensure the transmistted keys. In which the keys can not be trapped by eavesdroppers due to it may affect the key error rate more than a certain threshold vulue.

Moreover, the VPN tunnel establishment based on one-time password mechanism using quantum keys as a seed input to the hash function can be illustrated as the figure 4. Such, creating a hash chain value is called the response value which it is computed from both the password phase, the quantum seed and the sequence number in order to establish high secure VPN tunnel. The process get started reversing from the Nth

element of the hash chain which referred from the sequence number in order to identify the current position of response value to be used and finally destroy after the VPN tunnel get disconnected. The procedure of creating high secure VPN tunnel are explained as follows.

1) Manual copy quantum key and sequence number: At the first time of VPN tunnel setting up, the quantum keys as consider as a seed value (QKS) which it was generated from Quantum Random Number Generator (QRNG) [11]. While, the sequence number (SN) is used to indicate the hashing order. Both of the two values are manually distributed to the user in such a way that to prevent being attacked. These key values will be the input of the hash function. For any further reestablished a VPN tunnel, this procedure will not be proceeded until the sequence number become zero.

2) Generate response value at user site: The user need to complete the legitimate password acquired from the registration phase, the quantum key seed and the sequence number which allocated from the server to

each particular user such the inputs of hash function. Thus, an intermediate response value will be generated after finish its processing.

3) Transfer response value: The response value from the proper user site has been typically transmitted to the server in order to identity comparation .

4) Generating response value at the server site: The server will also generate its response value based on the relevant user information that had been stored in a user right database.

5) Comparing two response values: These the two response values are compared with each other. Matching result will be considered as a symmetric key encryption for VPN tunnel establishment. According to this procedure is very attractive approach due to it offer high security protection without performing a key exchange over insecure network.

6) Establishing a VPN tunnel at the server site: If the two response values are matched, then go forward to establish a tunnel. Creating a one site tunnel for secure connections the server will assigns virtual IP address to the user from local virtual subnet.

7) Establishing a VPN tunnel at user site: The procedue for setting up VPN tunnel at user site are similar to the one at server site via extenal network interface.

8) Site-to-Site VPN tunnel: The user and the server use virtual network interface to maintain a encrypted virtual tunnel. The entire information such as the actual user data, the ultimate source and the destination address are carried as a payload with authentication header. Lastly, the virtual IP address is inserted to the packet before transmission.

IV. COMPARISION AND ANALYSIS

This paper performs a comparasion and analysis for the qualities of VPN connection using the different types of encryption techniques including the term of symmetric key encryption, public key encryption. Finally, the proposed system has been distinguished since the technique adopted from classical encryption technique with added more features of one-time password mechanism and quantum keys to improve security performance.

A. Key Pattern Generator And Its Properites

Most of high secure communications are belonged with two important key factors. One is the quality of random number using as a cryptographic key and the two is the complexity of encryption algorithms. The algorithm will produce a different output that depend on the specific key being used at the time. Addressing to the sources of key generator can be produced into two different ways. Pseudo random number generators is the algorithms uses regarding to the mathematical functions, or simply precalculated tables in order to produce the sequence numbers to appear as the random value based on periodic pattern since this technique may decrease the securiy key performance due to it is feasible to find the next value of key generation from the existed

Figure 4. VPN tunnel establishment phase


11

pattern. As the result, it can taking to the risks associated with the use of pseudo random numbers to produce a key into cryptography systems. While, the proposed technique has applied the challenge of quantum keys to be used in both password generation and VPN tunneling encryption. Concerning to the performance of quantum keys which is actually true random number generator based on quantum physics use. The fact that subatomic particles appear to behave randomly in circumstances which difficult to find out the key value. The probability of key generation based on aperiodic pattern along with the random distribution scheme can increase the quality of key performance as great as the data security enhancement.

B. Security Key Protection and Performance

One of the best characteristics of QKD technology offers a promising unbreakable way to secure communications. In the way that eavesdroppers are trying to attempt to intercept the quantum keys during the key exchange state is detectable by introducing an abnormal Quantum Bit Error Rate (QBER). The result may occur error rate more than a certain threshold value due to unavoidable disturbance including an imperfect system configuration, noise or eavesdropper across to the quantum channel and secret key generation with respect to time. Hence, the exploitation of quantum mechanics offer the perfectly secure communications.

C. Key Exchange Protocol

In general QKD technology describes the process of using quantum communication to establish a shared secret key between two parties which is similar to the secret key exchange architecture. The proposed technique has applied the quantum keys acquired from the QKD system and then take forward to distribute to each responding user in secure mode. When the server received the signaling such password request, a partial key is dedicated to each user for further identifying as well as some of the quantum keys has assigned as a seed value into hash function in order to generate a particular secret key use to establish a VPN tunnel for secure data communications.

D. Mechanism Used

The promising of VPN encryption technology leads to client/server confidentiality. Thus, the proposed technique is focused to establish the high secure VPN tunnel such the technique has applied the challenges of one-time password mechanism as using the beauty of quantum keys, the sequential number and the user secret password in order to produce the response value as a specific symmetric key. These symmetric key will be use once at a particular time and later destroy after the VPN tunnel disconnected. In addition, new symmetric key will be periodically change along the one-way hash function properites which able to enhance data confidentiality and network security protection.

V. CONCLUSIONS AND FUTURE WORK

Improving VPN security performance based on one-time password technique using quantum keys presents a new trend mechanism to protect against data snooping from eavesdroppers when data is transmitted over insecure network. The proposed technique has offered three main procedures of concern, user registration stage, user authentication stage and VPN tunnel establishment stage in order to figure out various vulnerabilities and attacks. User registration stage performs key generation process to create the secret password to the responding user for further authentication. Since, the secret password is truly random numbers based on QKD technology which typically rely on the beauty of key exchange method to protect against eavesdroppers rather than the pseudo random numbers derived from mathematical functions due to the probability of brute force attacks may occur, if the password can be guessed. User authentication stage provides the legitimate users with transparent authentication in such a way that managing and monitoring access to private resources. Morover, the password life cycle management and hash functions have also been applied to solve security vulnerabilities. Finally, VPN tunnel establishment stage based on one-time password mechanism bring to an attractive approach to build up the highest security of virtual private

TABLE I. PERFORMANCE COMPARASION OF DIFFERENTIAL EVOLUTION TECHNIQUES

FeaturesProperties of VPN Connections

Symmetric Key Encryption Public Key Encryption Proposed Mechanism

Key Pattern GeneratorPeriodic pettern based on mathematical functions

Periodic pettern based on mathematical functions

Random pettern based onquantum phenomenon

Key Properties Pseudo random number based on mathematical functions

Pseudo random number based on mathematical functions

True random number based on quantum physics laws

Security Key Protection and Performance

Not providedany key protection mechanism

Not providedany key protection mechanism

Quantum Bit Error Rate (QBER) ratio

Key Exchange ProtocolSecret key exchange with one classical link communication

Public key exchange protocol with one classical link communication

Secret key exchange withtwo data link communition

(Quantum channel and classical channel)

Mechanism Used Secret key encryption Public and private key encryptionSecret key encryption based on one-time

password mechanism using quantum keys


12

network thus the particular key will be used only once and destroy. In addition, the challenges of the proposed mechanism is a part of project of high efficiency key management methodology for advanced communication services (a pilot study for video conferencing system) under the user authentication and VPN establishment phase in order to prevent against unauthorized access into the restricted network and the illegal resources before distributing the quantum keys for further along secure video conferencing and any data communications services as considering the data protection and network security are the main priority for IT organization need to be concerned.

ACKNOWLEDGMENT

The authors would like to thank Dr. Weetit Wanalertlak for the invaluable feedback and the technical support to achieve such the supreme excellence and the perfection of research paper. While, the authors would like to thank NECTEC steering committee for research support funding with a great valuable opportunity for team to introduces the new challenges of data protection as well as to improve data reliability over insecure network. Finally, the authors would like to thank Mr. Sakdinan Jantarachote and all staffs of Optical and Quantum Communications Laboratory (OQC) for all kind support and encouragement.

REFERENCES

[1] William Stallings, “Cryptography and Network Security Principles and Practices”, Fourth Editio, November 2005.

[2] Kazunori Ishimura, Toshihiko Tamura, Shiro Mizuno, Haruki Sato and Tomoharu Motono, “Dynamic IP-VPN architectuire with secure IP tunnels”, Information and Telecommunication Techonologies, June 2010, pp. 1-5.

[3] Young Sil Lee, Hyo Taek Lim and Hoon Jae Lee “A Study on Efficient OTP Generation using Stream with Random Digit”,Internatinal Conference on Advanced Communication Technology 2010, volume 2, pp. 1670-1675.

[4] W. Heisenberg, “Uber den anschaulichen Inhalt der quantenheoretischen Kinematik und Mechanik” In: Zeitschrift fur Physik, 43 1927, pp. 172-198

[5] W.K. Wootters and W.H. Zurek, “A Single Quantum Cannot be Cloned”, Nature 299, pp. 802-803, 1982.

[6] Erica Klarreich, “Quantum Cryptography: Can You Keep a Secret”, Nature, 418, 270-272, July 18, 2002.

[7] Hyun Chul Kim, Hong Woo Lee, Kyung Seok Lee, Moon Seong Jun, “A Design of One-Time Password Mechanism using Public Key Infrastructure”, Fourth Internatinal Conference on Network Computing and Advanced Information Management, September 2008, pp. 18-24.

[8] Harshvardhan Tiwari, “Cryptographic Hash Function: An Elevated View”, European Journal of Scientific Researchm, ISSN 1450=216X, Vol.43, No.4 (2010), pp. 452-465.

[9] Peiyue Li, Yongxin Sui, Huaijiang Yang and Peiyue Li “The Parallel Computation in One-Way Hash Function Designing”, International Conference on Computer, Mechatronics, Control and Electronic Engineering, Aug 2010, pp. 189-192.

[10] C.J.Mitchell and L. Chen, “Comments on the S/KEY user authentication scheme”, ACM Operating Systems Review, Vol.30, No. 4 , 1996, 10, pp. 12-16.

[11] ID Quantique White Paper, “Random Number Generation Using Quantum Physics”, Version 3.0, April 2010.


13

Experimental Results on the Reloading WaveMechanism for Randomized Token Circulation

Boukary OuedraogoPRiSM - CARO, UVSQ

45, avenue des Etats-UnisF-78035 Versailles Cedex, France

Email: [email protected]

Thibault BernardCRESTIC - Syscom, URCA,Moulin de la Housse BP-1039

F-51687, Reims cedex 2, FranceEmail: [email protected]

Alain BuiPRiSM - CARO, UVSQ

45, avenue des Etats-UnisF-78035 Versailles Cedex, France


Abstract—In this paper, we evaluate experimentally the gain ofa distributed mechanism called reloading wave to accelerate therecovery of randomised token circulation algorithm. Experimen-tation will be realised under different context: static networksand dynamic networks. The impact of different parameters suchas connectivity or frequency of failures will be investigated.

I. INTRODUCTION

Concurrence control is one of the most important require-ment in distributed systems. The emergence of wireless mobilenetworks has renewed the challenge to design concurrencecontrol solutions. These networks require a new modeling andnew solutions to take into account their intrinsic dynamicity. In[Ray91], the author classifies concurrency control in two types:quorum based solutions and token circulation based solutions.

Numerous papers deals with token circulation based solu-tions because they are easier to implement: a single tokencirculating represents the privilege to access the shared re-source (unicity of the token guarantee the safety, and perpetualcirculation among all nodes guarantee the liveness). In thecontext of dynamic networks, random walks based solutionhave been designed(see [Coo11]).

Properties of random walks allow to design a traversalscheme using only local information [AKL+79]: such ascheme is not designed for one particular topology and needno adaptation to other ones. Moreover, random walks offerthe interesting property to adapt to the insertion or deletionof nodes or links in the network without modifying anyof the functioning rules. With the increasing dynamicity ofnetworks, these features are becoming crucial: redesigning anew browsing scheme at each modification of the topology isimpossible.

An important result of this paradigm is that the token willeventually visit (with probability 1) all the nodes of a system.However it is impossible to capture an upper bound on thetime required to visit all the nodes of the system. Only averagequantities for the cover time, defined as the average time tovisit all the nodes are available.

The token circulation can suffer different kinds of failures:in particular, (i) situations with no token and (ii) situation withmultiple tokens may occur. Both of them have to be managedto guarantee the liveness and safety properties of concurrencecontrol solutions.

The concept of self-stabilization introduced in [Dij74] isthe most general technique to design a system to toleratearbitrary transient faults. A self-stabilizing system guaranteesto converge to a legitimate state in a finite time no matter whatinitial state it may start with. This makes a self-stabilizingsystem be able to recover from transient faults automaticallywithout any intervention.

To design self-stabilizing token circulation, numerous au-thors build and maintain spanning structures like tree orring (cf. [CW05], [HV01] and use the ”counter flushing”mechanism ([Var00]) to guarantee the presence of a singletoken. In the case of a random walk based token circulation,the ”counter flushing” can not be used. In [DSW06], theauthors use random circulating tokens (they call agents) tobroadcast information in communication group. To cope thesituation where no agent exists in the system, authors use atimer based on the cover time of an agent (k × n3). Theyprecise as a concluding remark. ”The requirements will holdwith higher probability if we enlarge the parameter k forensuring the cover time[. . . ]”. In the case of a concurrencecontrol mechanism the obtention of a single token is a strongrequirement, and the use of a parameter k which increases theprobability to reach a legitimate configuration could not beused.

We have introduced in [BBS11], the reloading wave mech-anism. This mechanism insures the obtention of single tokenand then the safety property of concurrence control solution.

We propose in this paper an experimental evaluation of thismechanism under different parameters: timeout initialization,connectivity of the network, dynamicity of the network andfailures frequency.

In order to test or validate a solution, the authors of[GJQ09] proposed four classes of methodologies: (i) in-situwhere one execute a real application (program, set of services,communications, etc.) on a real environment (set of machines,OS, middleware, etc.), (ii) emulation where one execute a realapplication on a model of the environment, (iii) benchmarkingwhere one execute a model of an application on a realenvironment and (iv) simulation where one execute a modelof an application on a model of the environment. To each ofthese methodologies corresponds a class of tools: real-scaleenvironments, emulators, benchmarks and simulators.


14

In this paper, we adopt the simulation class of method-ologies and use simulators due to the fact that simulationallows to perform highly reproducible experiments with alarge set of platforms and experimental conditions. Simulationtools support the creation of repeatable and controllableenvironments for feasibility study and performance evaluation[GJQ09], [SYB04].

Simulation tools for parallel and distributed systems can beclassified into three main categories:(i) Network simulation tools, Network Simulator NS-2 is a

simulator that supports several levels of abstraction tosimulate a wide range of network protocols via numeroussimulation interfaces. It simulates network protocols overwired and wireless networks. SimJava [HM98] providesa core set of foundation classes for simulating discreteevents. It simulates distributed hardware systems, com-munication protocols and computer architectures.

(ii) Simulation tools for grids, the most common tools in-clude: GridSim [BM02] supports simulation of space-based and time-based, large scale resources in the Gridenvironment. SimGrid [CLA+08], simulates a single ormultiple scheduling entities and timeshared systems op-erating in a Grid computing environment. It simulates dis-tributed Grid applications for resource scheduling. Dasor[Rab09] is a C++ library for discrete event simulation fordistributed algorithms (Management of networks (withtopologies), Failure models, mobility models, communi-cation models . . . , structures (trees, matrices, etc.) . . . .It is based on a multi-layers model (Application, GridMiddleware, Network).

(iii) Simulation tools for peer-to-peer networks, PeerSim[MJ09] supports extreme scalability and dynamicity. It iscomposed of two simulation engines, a simplified (cycle-based) one and an event driven one.

In [GJQ09], the comparison made between different sim-ulators for Networking and Large-Scale distributed systemsshow that any such tool provides a very high control ofthe experimental conditions (only limited by tool scalability),and a perfect reproducibility by design. The main differencesbetween the tools are (i) about the abstraction level (moderatefor network simulators, high for grid ones and very high forP2P ones), (ii) the achieved scale also greatly varies from toolto tool.

In the second section we present the model of the tokencirculation algorithm that uses the reloading wave mechanism.In the third section, we propose an experimental evaluation ofthe reloading wave mechanism. Finally we will conclude thepaper by presenting perspectives.

II. THE RELOADING WAVE MECHANISM

The reloading wave mechanism has fully described in[BBS11]. This mechanism has been designed to satisfy thespecifications of a single token circulation in presence offaults:(i) there is exactly one token in the system,

(ii) each component of the system is infinitely often visitedby the token.

The random walk token moving scheme insures to get thesecond part (ii) of the specification verified (as soon there is noadversary that plays with mobility of the components againstthe random moves of the token). Starting from an arbitraryconfiguration, the first part of the specifications can be entailedby following situations:

• absence of token,• multiple tokens situation.To manage the absence of token, each node setups a timeout

mechanism. Upon a timeout triggering, a node creates a newtoken, and then the absence of token situation no longeroccurs. The multiple token situation is managed like in [IJ90]:when several tokens meets on a node, they are merged into oneby a mergure mechanism. But unfortunately the combinaisonof the two mechanisms does not guarantee the presence ofexactly one token: if a subset of nodes is not visited by thetoken during a sufficiently long period, tokens creation canstill occurs even if there already exists a token.The goal ofthe reloading wave mechanism is to prevent these unnecessarycreations of tokens.

This prevention is realized by the token itself: it periodicallypropagates an information meaning that it is still alive. Thereloading wave uses several tools for its operations:

• a timeout mechanism: All nodes in the network assumea timeout procedure: a timer whose value decrements ateach tick clock. At the expiration of a node timer, thecorresponding node creates a new token and sends it toone of its neighbors as a random walk moving scheme.Remark that several tokens can circulate in the network.

• A adaptive spanning structure of the network topologythat is stored in the token. The spanning structure is storedas a circulating word that represents a spanning tree. Thistree is used to propagates the reloading wave. Every nodethat receives a reloading wave message resets its localtimer and then propagates the reloading wave message toall its sons according to the spanning tree maintained inthe word of the token.

• a hop counter that is stored in the structure of the token:Initialized to zero when creating the token, this hopcounter is incremented at every step of the random walk.It is reseted to zero each time the node that owns thetoken triggered a reloading wave propagation.

The different phases of the reloading wave mechanism arethe following:

1) Phase of reloading wave triggering• At the reception of a token on a node, the word con-

tent and the hop counter of the token are updated.• The reloading wave mechanism begins as soon as a

node, at the reception of a token, is aware that thetriggering condition is satisfied. The triggering con-dition is: if a received token hop counter (NbHop) isequal to the difference between initialization valueof timer (Tmax) and network size value (N). In


15

other words, the reloading wave is triggered at eachinterval of (Tmax − N ) steps of the token randomwalk. During this phase, the hop counter of thetoken is reset to zero.

• Reloading wave Messages are created by nodesat the initiative of a token more precisely its hopcounter (NbHop).

• Several reloading waves can be created (simulta-neously or not) and propagated through the treemaintained in the token word.

2) Phase of reloading wave propagationThe propagation of the wave takes place along an adap-tive tree contained in the word of the circulating token.Every node that receives a reloading wave messageresets its local timeout and then propagates too the waveto all its sons according to the adaptive tree maintainedin the token word.

3) Phase of reloading wave terminationThe reloading wave mechanism terminates when reload-ing waves reached all nodes of the virtual tree main-tained in token word or when transient faults obstructedits diffusion.

The complete implementation of the mechanism can be foundin [BBS11]. In the next section, we experiment the reloadingwave mechanism to evaluate its relevance.

III. EXPERIMENTAL RESULTS

Our simulation model is written in C++ using DASOR[Rab09], a C++ library for discrete event simulation fordistributed algorithms. The DASOR library provides lots ofinteresting structures and tools that makes an easy way to writesimulators.

We investigate experimental results of the reloading wavemechanism under three contexts: static network (no node con-nection / disconnection, no failure), dynamic network (nodeconnection / disconnection, no failure) and network subjectto failure (node connection / disconnection, token creation /deletion).

A. Experimental protocol

For each parameter investigated, we measure the timeelapsed in a satisfying configuration for two solutions:

1 A solution where a token is circulating according arandom walk scheme. A timeout is initialized on eachnode, to eventually create new tokens when expiring. Themerger mechanism is triggered when several token arepresent on the same node.

2 The same protocol but with the addition of the reloadingwave mechanism as described in the previous section.

A satisfying configuration is a configuration where thereis exactly one token present in the system. For each set ofparameters, we present the result as a difference between thetime elapsed in satisfying configuration with solution 2 andthe time elapsed in satisfying configuration with solution 1.

We evaluate the impact of several important parameters:

• Size of the network• Timeout initialization• Mobility range of the nodes• Frequency of failures

B. Experimentation

Each experimentation will be repeated 100 times, all theresults obtained are the mean over all the experimentation.The standard deviation has been computed but is negligible.

1) Static networks, impact of size and of the timeout ini-tialization: We set up the timeout values in function of thesize of the network (n). We take 2n, 3n, 4n, 5n as timeoutinitialization. Intuitively the greater the timeout is, the less thedifference between with and without reloading wave is, sincetoken creation occur on a timeout triggering (a non necessarytoken creation compromises the satisfying configuration). Onthe other hand, the greater the size is, better the solutionwith reloading wave work since the mechanism avoids allunnecessary token creation. The solution without the reloadingwave mechanism have to insure the visit of all nodes during atimeout period to avoid token creation. Greater the size of thenetwork is, more difficult is to visit all nodes of the networkwith a random moving policy.

The results are given in TableI with the form T1 − T2 = ∆where T1 is the percentage of time elapsed in a satisfyingconfiguration with the solution with reloading wave, T2 thepercentage without using the reloading wave mechanism and∆ the difference.

Timeout initialization T = f(n)Size n 2n 3n 4n 5n

50 99-22= 77% 99-46= 53% 99-67= 32% 99-82= 17%100 99-16= 83% 99-38= 60% 99-58= 41% 99-74= 25%200 99-11= 88% 99-32= 67% 99-50= 49% 99-66= 33%300 99-10= 89% 99-27= 72% 99-46= 53% 99-62= 37%

Table IDIFFERENCE BETWEEN THE SOLUTION WITH RELOADING WAVE AND THE

SOLUTION WITHOUT RELOADING WAVE FOR STATIC NETWORKS

Our intuition is verified: the reloading wave avoid allunnecessary token creations (the system is in a satisfyingconfiguration 99% of time, the last 1% correspond to theinitialization phase, where there is not enough collected datato propagate the reloading wave). The size decrease theperformance of the solution without the reloading wave andthe timeout improves its performance.

2) Dynamic networks, impact of the dynamicity and fail-ures: A dynamic network is subject to topological reconfig-urations and failure, we investigate the impact of these twoparameters on the behavior of the reloading wave mechanism.The two solutions (with and without reloading wave) havebeen experimented on a random graph of 300 nodes witha density of 60% (i.e. a link between two nodes has theprobability 0.6 to exist at the initialization of the network).

We have used a mobility pattern where: (i) the movementsof nodes are independent of each other, (ii) at any time


16

there is a fixed number of nodes randomly chosen that aredisconnected, (iii) the duration of the disconnection is setarbitrary to 1 time unit. This model can be assimilated to therandom walk mobility model (cf. [CBD02]).

We set this parameter to get:• A low mobility pattern: at a given time, 1% of nodes

are disconnected. This value is reasonable to evaluate theperformance of the algorithm in the conditions of a slowmoving network.

• An average mobility pattern: at a given time, 5% ofnodes are disconnected. This value is reasonable to eval-uate the performance of the algorithm in the conditionsof a medium speed moving network.

• A high mobility pattern: at a given time, 10% of nodesare disconnected. This value is reasonable to evaluate theperformance of the algorithm in the conditions of a fastmoving network.

In the same way, a failure model has been applied: all tokenmessages have the same probability p to fail at every timeinterval t. We set p to 0.05%, since it seems to be a realisticvalue for a message loss in a network and t to:

• A low failure pattern: Each 1000 turns, every token hasa probability 0.05% to be lost.

• An average failure pattern: Each 100 turns, every tokenhas a probability 0.05% to be lost.

• A high failure pattern: Each 10 turns, every token hasa probability 0.05% to be lost.

Results are given in Table. II with the form T1 − T2 = ∆where T1 is the percentage of time elapsed in a satisfyingconfiguration with the solution with reloading wave, T2 thepercentage without using the reloading wave mechanism and∆ the difference.

Token loss frequencyMob. freq. None Low Average High

None 99-10= 89% 9-12= -3% 0-0= 0% 0-0= 0%Low 34-31= 3% 30-29= 1% 26-26= 0% 25-25= 0%

Average 34-31= 3% 29-29= 0% 26-26= 0% 25-25= 0%High 34-31= 3% 30-29= 1% 26-26= 0% 25-25= 0%

Table IIDIFFERENCE BETWEEN THE SOLUTION WITH RELOADING WAVE AND THE

SOLUTION WITHOUT RELOADING WAVE FOR DYNAMIC NETWORKS

Frequent token loss decrease greatly the performance ofboth solutions (for the given token loss frequency parameter,between 30% and 25% of satisfying configurations). The gainof the reloading wave in the static context is marginal in thecontext where token can be lost (less than 1% for low, averageand high token loss frequency). This is not surprising: thereloading wave mechanism relies on the persistence of thetoken. As soon as the token can be lost, the spanning treestored inside the token can not be built, and then several nodescan create new tokens.

The impact of the mobility is not the same. Of course, a toofrequent mobility pattern decreases highly the performance ofboth solutions, but it remains that the reloading wave solution

has a marginal gain on the solution without reloading wavewhen there is no token loss (about 3%). We think the reloadingwave could be use for network with a very slow mobilitypattern: if the frequency of node move is low, the spanningtree stored inside the token has enough time to be updated,and the reloading wave mechanism could work correctly.

IV. CONCLUSION

In this paper, we have investigated experimental results ofthe reloading wave which is a mechanism to avoid unnecessarytoken creations, in static networks, dynamic networks andnetwork subject to failure.

In a static environment, the reloading wave works perfectly(about 99% of satisfying configurations, the 1% remainingcorresponds certainly to the initialization of the spanningstructure on which the reloading wave is broadcast). Thedifference between the two solutions (with / without reloadingwave) increases with the augmentation of the timeout valueand decreases with the size of the network.

The mobility of nodes has an impact on the functioning ofthe reloading wave: mobility of nodes can break the spanningstructure used to broadcast the reloading wave. In [BBS11]we exhibit a mobility pattern on which the reloading waveworks correctly. In our experimentation this mobility patternhas not be implemented. We think the mobility used in theexperimentation is too important to fit the criterion of themobility pattern on which the reloading wave works. A newset of experimentation on the mobility pattern is investigated.

The occurrence of failures has an impact on the reloadingwave mechanism. As the mechanism is initiated by the to-ken, a token loss occurring frequently, decreases highly theperformances of reloading wave (about 25% of satisfyingconfigurations with the given parameters). In most token cir-culation algorithms, token loss is considered as an improbableevent and a recovery has to be manage carefully. Our solutionis not an exception to the rule, a recovery takes a longtime (according timeout value) being elapsed in a set of nonsatisfying configurations.

REFERENCES

[AKL+79] R. Aleliunas, R. Karp, R. Lipton, L. Lovasz, and C. Rackoff.Random walks, universal traversal sequences and the complexity of mazeproblems. In 20th Annual Symposium on Foundations of ComputerScience, pages 218–223, 1979.

[BBS11] Thibault Bernard, Alain Bui, and Devan Sohier. Universal adaptiveself-stabilizing traversal scheme: random walk and reloading wave. CoRR,abs/1109.3561, 2011.

[BM02] Rajkumar Buyya and Manzur Murshed. Gridsim: A toolkit forthe modeling and simulation of distributed resource management andscheduling for grid computing, concurrency and computation. Practiceand Experience (CCPE), 14:Issue 13–15, 1175–1220, December 2002.

[CBD02] T. Camp, J. Boleng, and V. Davies. A survey of mobility modelsfor ad hoc network research. Wireless Communications and MobileComputing (WCMC): Special issue on Mobile Ad Hoc Networking:Research, Trends and Applications, 2(5):483–502, 2002.

[CLA+08] Henri Casanova, Legrand, Arnaud, Quinson, and Martin. SimGrid:a Generic Framework for Large-Scale Distributed Experiments. In 10thIEEE International Conference on Computer Modeling and Simulation,March 2008.


17

[Coo11] Colin Cooper. Random walks, interacting particles, dynamic net-works: Randomness can be helpful. In 18th International Colloquium onStructural Information and Communication Complexity, Gdansk, Poland,June, 2011, volume 6796 of Lecture Notes in Computer Science, pages1–14. Springer, 2011.

[CW05] Yu Chen and Jennifer L. Welch. Self-stabilizing dynamic mutualexclusion for mobile ad hoc networks. J. Parallel Distrib. Comput.,65(9):1072–1089, 2005.

[Dij74] Edsger W. Dijkstra. Self-stabilizing systems in spite of distributedcontrol. Commun. ACM, 17(11):643–644, 1974.

[DSW06] S. Dolev, E. Schiller, and J. L. Welch. Random walk for self-stabilizing group communication in ad hoc networks. IEEE Trans. Mob.Comput., 5(7):893–905, 2006.

[GJQ09] Jens Gustedt, Emmanuel Jeannot, and Martin Quinson. Experimen-tal methodologies for large-scale systems: a survey. Parallel ProcessingLetters, 19(3):399–418, 2009.

[HM98] Fred Howell and Ross McNab. Simjava: A discrete event simulationpackage for java with applications in computer systems modelling. In Pro-ceedings of the First International Conference on Web-based Modellingand Simulation, San Diego CA, January 1998. Society for ComputerSimulation.

[HV01] Rachid Hadid and Vincent Villain. A new efficient tool for the designof self-stabilizing l-exclusion algorithms: The controller. In Ajoy KumarDatta and Ted Herman, editors, WSS, volume 2194 of Lecture Notes inComputer Science, pages 136–151. Springer, 2001.

[IJ90] Amos Israeli and Marc Jalfon. Token management schemes andrandom walks yield self-stabilizing mutual exclusion. In PODC, ACM,pages 119–131, 1990.

[MJ09] Alberto Montresor and Mark Jelasity. PeerSim: A scalable P2Psimulator. In Proc. of the 9th Int. Conference on Peer-to-Peer (P2P’09),pages 99–100, Seattle, WA, September 2009.

[Rab09] C. Rabat. Dasor, a Discret Events Simulation Library for Grid andPeer-to-peer Simulators. Studia Informatica Universalis, 7(1), 2009.

[Ray91] Michel Raynal. A simple taxonomy for distributed mutual exclusionalgorithms. Operating Systems Review, 25(2):47–50, 1991.

[SYB04] Anthony Sulistio, Chee Shin Yeo, and Rajkumar Buyya. Ataxonomy of computer-based simulations and its mapping to parallel anddistributed systems simulation tools. Softw. Pract. Exper., 34:653–673,June 2004.

[Var00] George Varghese. Self-stabilization by counter flushing. SIAM J.Comput., 30(2):486–510, 2000.


18

Statistical-based Car Following Model for Realistic Simulation of Wireless Vehicular Networks

Kitipong Tansriwong and Phongsak Keeratiwintakorn

Department of Electrical Engineering, King Mongkut’s University of Technology North Bangkok,

Bangkok, THAILAND [email protected] and [email protected]

Abstract— In present, research on mobile and wireless vehicular networks has been focused on the communication technologies. However, the behavioral study of vehicular networks is also important, and can be costly. Therefore, simulation software has been developed and used for research and study in vehicle movement and the variability of the network performance. Therefore, the realisticness of the simulation software is an on-going research work. The major problem of the simulation software is the mobility pattern or model of vehicles under study that is non-realistic due to the complexity of the model due the variation of the driver’s behavior and sometimes can be such costly that it is impractical for the study. In this research, we proposed a realistic mobility model that is the integration between the statistical analysis and the car following model. By using a real data collection to create a mobility mode based on probability distribution that is integrated to the well-known car following model, the vehicular network simulation study can be more realistic. The results of this study have shown the opportunity to combine such proposed model into a network simulation such as NCTU-ns or ns-2 simulator. Keywords- vhicular network, realistic mobility model, car following models,statistical model

I. INTRODUCTION

Vehicles as part of our business logistics and every living, the number of vehicles has increased every year, but the road capacity could not be increased at the rate equal to that of the vehicles. This causes many problems such as accidents and traffic jam, and economical loss in term of expense spend for transportation. Another issue is the lack of real-time traffic data that can help to mitigate such problems. Wireless vehicular technology is emerged with the goal to enable the communication between vehicle-to-vehicle (V2V) or vehicle-to-infrastructure (V2I). Such technology will allow the information exchange from road devices such as detectors to infrastructure or to vehicles directly for faster response to events such as accidents or emergency rescues in V2I case. In addition, data collected on the site can be used to analyze traffic information can be used in several ways for applications in traffic engineering or for traveler information center (TIC). In V2V case, the traffic information can be broadcast or distributed over a group of vehicles to immediately inform drivers such events.

V2V communication consists of moving vehicles with a variety of movement patterns that can be uncertain depending on many factors such as driver behavior, road conditions, and traffic conditions. Thus, the mobility model used to simulate vehicle communication scenarios in conjunction with V2V communication can be erroneous when compared to the realistic system. This paper proposes a solution to reduce the error due to the theoretical mobility model of vehicle movement in the simulation. In order to keep the mobility close to reality as much as possible, we propose using statistical analysis techniques used to analyze the collected data samples to form a statistical model to be integrated with the car following model. The car following model is the movement model that is proposed in transportation engineering approaches that combines driver behavior in multi-lane road infrastructure. The outcome of our proposed model can be used in available simulation software such as NCTU-ns [1], ns-2 [2], or ONE [3] simulation software.

II. RELATED WORK

Many researchers have proposed the studies that involve the mobility models of vehicles on the road that can apply to V2V communication on the roads both in traffic engineering approach and computer engineering approach. Therefore it is necessary to study the mobility models in different forms and different simulation software packages.

A. VANET mobility models

Several mobility models are proposed for use in vehicular network (VANET) for simulation study such as Freeway model, Manhattan model, and City Section model.

Freeway is a map-generated-based model as defined in [4]. This model was tested on the highway or street without a traffic light junction. The movement of vehicles in the traffic will be forced to follow the front vehicles. There is no overtaking or changing lane to traffic. The speed of the vehicle in this model is defined by history-based speed and random acceleration. When the distance between the vehicles that follow the same traffic lane is less than the specified model, the acceleration is negative. The movement pattern of the model is not realistic. The example of vehicle movement in the Freeway model is shown in Figure 1.


19

Figure 1: The vehicle movement in the Freeway model

Manhattan is also a map-generated-based model introduced in [5] to simulate an urban environment. Eachcrossroads. By using the same speed on a freeway modelthe increase of the traffic lane, the model allows the lane change at a traffic intersection. The direction of movement of vehicle traffic at the traffic intersection used a probability andthe vehicle that has been moved cannot beexample of the vehicle movement in the Manhattan model is shown in Figure 2.

Figure 2 The vehicle movement in the Manhattan model

The City Section Mobility model [6] is cprinciple between Random Way Point model andmodel. By adding the pause time and the destination, the model uses the map based on the Mmodel with the random number of vehicles onmovement to a given destination using algorithm. The speed of the movement distance between moving vehicles and the vehicle the maximum speed of the road. The downside is the use of the grid-like map that is notrealistic road networks. The example of the City Section Mobility model is shown in Figure 3.

B. Simulation Software Package

Along with the mobility models are the simulation software for vehicular networks. Several simulation software package offer the use of several mobility models as described in this section.

MOVE [7] is a software package to generate a traffic trace based on realistic mobility model for VANET

Freeway model

based model introduced Each road will have a

freeway model with , the model allows the lane The direction of movement of

intersection used a probability and be stopped. The

example of the vehicle movement in the Manhattan model is

Manhattan model.

] is created by the oint model and Manhattan

random selection based on the Manhattan

on the road and the the shortest path depends on the

the vehicle ahead and The downside of this model

not as complex as The example of the City Section

Along with the mobility models are the simulation software for vehicular networks. Several simulation software package offer the use of several mobility models as described

software package to generate a traffic trace for VANET that works with

the SUMO simulation softwaretraffic simulator that includes the such as traffic light signals or the model ofvehicular communication. The real map to analyze traffic and based on such traffic. The MOVEand can import a real map file format or the Google Earth properties of each car such asMOVE operates and interfacesfile to be used in the network simulation such as nshows the snapshot of the SUMO

Figure 3: The pattern of the vehicle location in the Section Mobility

Figure 4: SUMO

VanetMobiSim [9] is a modifier ofSimulation Environment which is based on frameworks for mobility modeling.written in JAVA language and can generate movement tracesfiles in different formats, and or simulation tools for mobile networks. CANU originally includes parsers for maps in the Files (GDF) format and provides implementations of several random mobility models as well as models from physics and vehicular dynamics. The VanetMobiSim with vehicular mobility modeautomotive motion models at both macroscopic and microscopic levels. At the macroscopic level, Va

SUMO simulation software. SUMO [8] is a vehicular that includes the management of the traffic

or the model of traffic lanes for The SUMO program can import a and to generate a mobility model

MOVE program is built with Java real map file format such as TIGER Line file

KML file. It can specify the each car such as speed and acceleration. The

interfaces with SUMO to generate a trace network simulation such as ns-2. Figure 4

snapshot of the SUMO and MOVE software.

The pattern of the vehicle location in the City obility model

SUMO and MOVE

a modifier of the CANU Mobility which is based on flexible

mobility modeling. CANU MobiSim is and can generate movement traces

and supporting different simulation s for mobile networks. CANU MobiSim

originally includes parsers for maps in the Geographical Data and provides implementations of several

models as well as models from physics and The VanetMobiSim is designed for use

modeling, and features on realistic automotive motion models at both macroscopic and

macroscopic level, VanetMobiSim


20

can import maps from the TIGER Line database, or randomly generate them using Voronoi tessellation. VanetmobiSim can be support for multi-lane roads, separate directional flows, differentiated speed constraints and traffic signs at intersections. At the microscopic level, VanetMobiSim implements mobility models, providing realistic V2V and V2I interaction. According to these models, vehicles regulate their speed depending on nearby cars, overtake each other and act according to traffic signs in presence of intersections. Figure 5 shows snapshot of the the VanetMobiSim software.

Figure 5: The snapshot of the VanetMobiSim software

C. The Car Following Model

In the car following models, the behavior of each driver is described in relation to the vehicle ahead. With regard to each single car as an independent entity, the car following model falls into the category of the microscopic level mobility model. Each vehicles in the car following model computes its speed or its acceleration as a function of the factors such as the distance to the front car, the current speed of both vehicles, and the availability of the side lane. Figure 6 shows the example of the vehicle movement calculation of the car following model. Each vehicle is assigned its lane, i or j, and its location on the road segment. At each time slot, each vehicle’s speed, acceleration, and lane change probability is calculated based on the microscopic view of the road network and traffic. As a result, each vehicle may increase or decrease the speed and/or the acceleration. In addition, the vehicle may overtake or change the lane with the assign probability when the side lanes are available. The speed and the acceleration of the vehicle keep changing based on the conditions happening during the simulation such as car stop, congestion, or traffic stop light at a crossroad. The change is totally based on the random model without any integration of road structure properties such as the lane narrowness that can affect the speed of vehicles. In addition, as roads are connected in a network, they are different in types, structures and sometimes driving culture based on each country or city that can be classified as localized parameters. Therefore, the car following model is a realistic model that is suitable for vehicular network simulation, but it lacks of an adaptive

method to change the driving behavior according to those localized parameters.

Figure 6 The example of the car following model

III. THE PROPOSED STATISTICAL-BASED CAR FOLLOWING

MODEL

The proposed design technique for a realistic mobility model for used in a simulation on wireless vehicular network is described in this section. First, the statistical model is present. Due to the lack of the integration of the localized parameter to be taken into account in the calculation of each vehicle speed and acceleration, we proposed a statistical model of each road that is created based on the collected data on that road as a representative for the calculation in the car following model. Figure 7 shows the block process of our proposed work for the integration with the car following model.

Based on the variety of the vehicle speed data that is collected from different road types, we analyze the data and find a representative of such data set as a probability function. The probability function is used to generate vehicle speed data at a specific period based on the ID of the road on the map (Map ID). The outcome of the function, which is the speed, is used as an input to the car following model to calculate the speed, the acceleration and the lane change probability during such period. The variation of the length of the period is the tradeoff between the additional workload on the simulation and the realisticness of the model.

Figure 8 shows the process of data collection on each specific road for our study. We use a Nokia mobile phone with our written software running on Symbian OS to collect data such as the current location of the vehicle and the current vehicle speed. The locating devices used in our data collection is the global positioning device (GPS) that is connected to the mobile phone via Bluetooth connection.

Several research studies in the movement pattern of vehicles are to study the effect of the types of the roads by collecting real data for road traffic verification and validation. The road type that is currently in use can be divided into expressway or freeway road system, major or arterial road, collector road and local road.

Figure 7 The proposed statistical based car following model


21

Figure 8 The data collection process for road specific

information After we collect the data on different road types, we use a

statistical method for curve-fitting the collected data. The distribution function as the result of the curve fitting is then verified and validated. Once, we have the distribution function, we implement the function into the calculation of the car following model specific of those road types.

IV. MEASUREMENTS AND RESULTS

We collect traffic data, the vehicle location and the speed, based on four types of the roads, the express way, the major road, the collector road and the local road. These types of roads have different properties that can affect the behavior of the drivers. Then, we proposed a statistical model for each road type based on the curve fitting method.

A. The Model of Expressway

The sample of the express way that we used to collect data in Bangkok is the Ngamwongwan-Chiang Rak road. We collect the speed data at a sampling time of 2 second for each sample. The location samples of the vehicle running on the Ngamwongwan road are shown in yellow dots in Figure 9.

Figure 9 The location of vehicles on the Expressway road

Totally, we collect about 786 samples of speed data. The result of curve fitting of the collected speed data on the expressway road is shown in Figure 10. The statistical model that fits the distribution of the speed data is the Weibull function where most vehicle speed is high due to the nature of

the expressway where there is no traffic light and wide lanes. The peak speed from this collection is around 110 km/hr.

Figure 10 The PDF function of the speed on the Expressway.

B. The Model of Major Road

For the major road data collection, we chose the Vibhavadi road, which is a straight and long road with two or three traffic lane for each direction. Although the road has no traffic light, a temporary stop may occur due to the density of the cars or the results of the truck movement. We collect each speed sample at every 2 second, and totally we have 1227 values in the collection. Figure 11 shows the result of the curve fitting method over the collected data. It is shown that the speed distribution on the major road is also the Weibull distribution. However, the parameters of the function are different. For example, the peak speed of the major is around 65 km/hr that is much less than that of the express way.

Figure 11 The PDF function of the speed on the major road.

C. The Model of Collector Road

In this collection, we chose the Charansanitwong road as our example of the collector road. The road has two lanes for each direction. It has a few traffic light with a short time stop. The samples are collected every 2 second, and totally we have 511 values of the data. Figure 12 shows the result of the curve fitting method on the collected data. It is shown that the result of the curve fitting is the Gamma distribution function. Due to the traffic light, vehicles may stop and slow down. As a result,


22

most of the speed samples are in the lower section that is result in the Gamma function.

Figure 12 The PDF function of the speed on the collector road.

D. The Model of Local Road

In Bangkok, many local roads are used to connect between major and collector road. Vehicles in local road tend to be stopped at a traffic light for a longer period due to its low priority for traffic light management. In addition, it has many traffic lights and many non-traffic light junction as well as car parking alongside the road, and those tend to slow down the traffic. For our data collection, we chose the Prachacheun road since the road is straight, but has many traffic light. It also crosses many major road such as Tivanon Road and Vibhavadi Road. We collect each data sample at every 2 second, and totally we have 382 values. The result of the curve fitting model of the data collected on the local road is shown in Figure 13. The distribution of the collected data can be represented as Gamma distribution similar to that of the collector. This is due to the nature of the road with traffic lights where vehicles may be stopped or slowed down. However, the average vehicle speed of the local road from the distribution is much lower than that of the collector road.

From the results, it is shown that the distribution function of the speed of vehicles on different types of roads can be different. The speed distribution function is not uniformly random as assumed in several mobility model for network simulation. The distribution of roads tends to be either Weibull or Gamma distribution function. The speed distribution of the road with no traffic light is probably a Weibull distribution while that of the road with traffic light is probably a Gamma distribution. The average speed of each road type tends to be affected by the number of lanes and some other properties such as the narrowness of the available lane that can be narrowed by parking cars along the road. In addition, the average speed of the road can be unique to that road. Further studies are required to find the “identity” or the “fingerprint” of such road.

Figure 13 The PDF function of the speed on the local road.

V. CONCLUSION

In this paper, we proposed a statistical-based car following model concept that integrates the uniqueness of each road type into the calculation of the speed of the car following model. The uniqueness of the road can occur due to the different road structure that is very specific to each area of the road networks. The behavior of drivers on each road types can be different. We have found that the road types such as the expressway or the major road tend to have a speed distribution function as a Weibull distribution, but with the different average speed. However, for the collector road type of the local road type, the speed distribution function tends to be a Gamma distribution, but with the different average speed. This may be concluded that the speed of the road without traffic light may be modeled as the Weibull distribution, and that of the road with traffic light may be modeled as the Gamme distribution. However, the average speed of the distribution may be varies based on the nature of such roads. Further research is necessary to investigate into details of traffic data.

REFERENCES [1] NCTU-ns simulation software, available at

http://nsl.csie.nctu.edu.tw/nctuns.html, last accessed date: 30/1/2012. [2] The ns-2 simulation software, available at http://www.isi.edu/nsnam/ns/,

last accessed date: 30/1/2012. [3] The ONE simulation software, available at

http://www.netlab.tkk.fi/tutkimus/dtn/theone/, last accessed date: 30/1/2012.

[4] F. Bai, N. Sadagopan, A. Helmy, “Important: a framework to systematically analyze the impact of mobility on performance of routing protocols for ad hoc networks”, in Proc. 22th IEEE Annual Joint Conference on Computer Communications and Networking INFOCOM’03, 2003, pp. 825-835.

[5] V. Davies,“ Evaluating mobility models within an ad hoc network”, Colorado School of Mines, Colorado, USA, Tech. Rep., 2000.

[6] F. K. Karnadi, Z. H. Mo, K. Chan Lan, “Rapid generation of realistic mobility models for VANET”, in Proc. IEEE Wireless Communications and Networking Conference, 2007,pp. 2506-2511.

[7] MOVE mobility model, available at http://lens.csie.ncku.edu.tw/Joomla_version/index.php/research-projects/past/18-rapid-vanet, last accessed date: 30/1/2012.

[8] “SUMO Simulation of Urban Mobility”, available at http://sumo.sourceforge.net/, last accessed date: 30/1/2012.

[9] J. HÄarri, F. Filali, C. Bonnet, and Marco Fiore. Vanetmobisim: Generating Realistic Mobility Patterns for VANETs. In VANET: '06: Proceedings of the 3rd international workshop on Vehicular ad-hoc networks, pages 96-97, ACM Press, 2006


23

Rainfall Prediction in the Northeast Region of Thailand

using Cooperative Neuro-Fuzzy Technique

Jesada Kajornrit1, Kok Wai Wong

2, Chun Che Fung

3

School of Information Technology, Murdoch University

South Street, Murdoch, Western Australia, 6150

Email: [email protected], [email protected], [email protected]

Abstract—Accurate rainfall forecasting is a crucial task for re-

servoir operation and flood prevention because it can provide an

extension of lead-time for flow forecasting. This study proposes

two rainfall time series prediction models, the Single Fuzzy Infe-

rence System and the Modular Fuzzy Inference System, which

use the concept of cooperative neuro-fuzzy technique. This case

study is located in the northeast region of Thailand and the pro-

posed models are evaluated by four monthly rainfall time series

data. The experimental results showed that the proposed models

could be a good alternative method to provide both accurate re-

sults and human-understandable prediction mechanism. Fur-

thermore, this study found that when the number of training data

was small, the proposed model provided better prediction accu-

racy than artificial neural networks.

Keywords-Rainfall Prediction; Seasonal Time Series; Artificial

Neural Networks; Fuzzy Inference System; Average-Based Interval.

I. INTRODUCTION

Rainfall forecasting is indispensable for water management because it can provide an extension of lead-time for flow fore-casting used in water strategic planning. This is especially im-portant when it is used in reservoir operation and flood preven-tion. Usually, rainfall time series prediction has used conven-tional statistical models and Artificial Neural Networks (ANN) [8]. However, such models are difficult to be interpreted by human analysts, because the prediction mechanism is in para-metric form. From a hydrologist’s point of view, the accuracy of prediction and an understanding in the prediction mechan-ism are equally important.

Fuzzy Inference System (FIS) uses the process of mapping from a given set of inputs variables to outputs based on a set of human understandable fuzzy rules [19]. In the last decades, FIS has been successfully applied to various problems [3], [4]. An advantage of FIS is that its decision mechanism is interpretable. As fuzzy rules are closer to human reasoning, an analyst could understand how the model performs the prediction. If neces-sary, the analyst could also make use of his/her knowledge to modify the prediction model [5]. However, the disadvantage of FIS is its lack of learning ability from the given data. In con-trast, an ANN is capable of adapting itself from training data. In many cases where human understanding in physical process is not clear, ANN has been used to learn the relationship be-tween the observing data [6]. However, the disadvantage of ANN is its black-box nature, which is difficult to be inter-preted. In order to combine the advantages of both models, this paper propose two rainfall time series prediction models, the Single Fuzzy Inference System (S-FIS) and the Modular Fuzzy

Inference System (M-FIS), which use the concept of coopera-tive neuro-fuzzy technique.

This paper is organized as follows; Section 2 discusses the related works and Section 3 describes the case study area. In-put identification and the proposed models are presented in Sections 4 and 5 respectively. Section 6 shows the experimen-tal results. Finally, Section 7 provides the conclusion of this paper.

II. SOFT COMPUTING TECHNIQUES IN HYDROLOGICAL TIME

SERIES PREDICTION

In the hydrological discipline, rainfall prediction is relative-ly difficult than other climate variables such as temperature. This is due to the highly stochastic nature in rainfall, which shows a lower degree of spatial and temporal variability. To address this challenge, ANN has been adopted in the past dec-ades. For example, Coulibaly and Evora [7] compared six dif-ferent ANNs to predict daily rainfall data. Among different types of ANN, they suggested that the Multilayer Perceptron, the Time-lagged Feedforward Network, and the Counter-propagation Fuzzy-Neural Network provided higher accuracy than the Generalized Radial Basis Function Network, the Re-current Neural Network and the Time Delay Recurrent Neural Network. Another work was Wu et al. [8]. They proposed the use of data-driven models with data preprocessing techniques to predict precipitation data in daily and monthly scale. They proposed three preprocessing techniques, namely, Moving Av-erage, Principle Component Analysis and Singular Spectrum Analysis to smoothen the time series data. Somvanshi et al. [1] confirmed in their work that ANN provided better accuracy than ARIMA model for daily rainfall time series prediction.

Time series prediction is not only used for rainfall data but also streamflow and rainfall-runoff modeling. Wang et al. [9] compared several computational models, namely, Auto-Regressive Moving Average (ARMA), ANN, Adaptive Neural-Fuzzy Inference System (ANFIS), Genetic Programming (GP) and Support Vector Machine (SVM) to predict monthly dis-charge time series. Their results indicated that ANFIS, GP and SVM have provided the best performance. Lohani [10] com-pared ANN, FIS and linear transfer model for daily rainfall-runoff model under different input domains. The results also showed that FIS outperformed linear model and ANN. Nayak et al. [11] and Kermani et al. [12] proposed the use of ANFIS model to river flow time series. In addition, Jain and Kumar [13] applied conventional preprocessing approaches (de-trended and de-seasonalized) to ANN for streamflow time se-ries data.


24

Figure 1. The case study area is located in the northeast region of Thailand. The positions of four rainfall stations are illustrated by star marks.

Up to this point, among all works mentioned, FIS itself has not been used as widely as ANN for time series prediction. Especially for rainfall time series prediction, reports on appli-cations of FIS are limited. Thus, the primary aim of this study is to investigate an appropriate way to use FIS for rainfall time series prediction problem.

III. CASE STUDY AREA AND DATA

The case study described in this study is located at the northeast region of Thailand (Fig 1). Four rainfall time series selected are depicted in Fig 2. Table 1 shows the statistics of the datasets used. The data from 1981 to 1998 were used to calibrate the models and data from 1999 to 2001 were used to validate the developed models. This study used the models to predict one step-ahead, that is, one month. To validate the models, Mean Absolute Error (MAE) is adopted as given in equation (1). The Coefficient of Fit (R) is also used to confirm the results. The performance of the proposed model is com-pared with conventional Box-Jenkins (BJ) models, Autoregres-sive (AR), Autoregressive Integrated Moving Average (ARIMA) and Seasonal Autoregressive Integrated Moving Average (SARIMA) [1], [8], [10], [13] and [15].

𝑀𝐴𝐸 = 𝑂𝑖 − 𝑃𝑖 𝑚𝑖=1 𝑚 (1)

TABLE I. DATASETS’ STATISTICS

Statistics TS356010 TS381010 TS388002 TS407005

Mean 1303.34 889.04 1286.28 1319.70

SD 1382.98 922.99 1425.88 1346.80

Kurtosis -0.10 0.808 0.532 -0.224

Skewness 0.95 1.080 1.131 0.825

Minimum 0 0 0 0

Maximum 5099 4704 6117 5519

Latitude 104.13E 102.88E 104.05E 104.75E

Longitude 17.15N 16.66N 16.65N 15.50N

Altitude 176 164 155 129

(TS356010)

(TS381010)

(TS388002)

(TS407005)

Figure 2. The four selected monthly rainfall time series used in this study.

TS356010

TS381010

TS388002

TS407005


25

IV. INPUT IDENTIFICATION

In general, input of a time series model are normally based on previous data points (Lags). For BJ models, the analysis of autocorrelation function (ACF) and partial autocorrelation function (PACF) are used as a guide to identify the appropriate input. However, in the case of ANN or other related non-linear models, there was no theory to support the use of these func-tions [14]. Although some literatures addressed the applicabili-ty of ACF and PACF to non-linear models [15], other litera-tures preferred to conduct experiments to identify the appropri-ate input [11].

This study conducted an experiment to find an appropriate input based on data from five rainfall stations. Data from 1981 to 1995 were used for calibration and data from 1996 to 1998 were used for validation. By increasing the number of lags to ANNs, six different inputs models were prepared and tested. To predict x(t), first input model is x(t-1), second input model is x(t-1), x(t-2) and so on. Fig 3 shows the results from the experi-ment. In this figure, average normalized MAEs from five time series are illustrated in bold line. The results show that the MAE is the lowest at lag 5. The Five previous lags model is expected to be an appropriate input. Since increasing the num-ber of input lags dose not significantly improve the prediction performance, additional methods may be needed.

In the case of seasonal data, there are other methods to identify an appropriate input to improve the prediction accura-cy, for examples, using the Phase Space Reconstruction (PSR) [16] and adding time coefficient as a supplementary feature [2]. However, in the first method, large number of training data is needed. According to “The Curse of Dimensionality”, when the number of input dimensions increases, the number of training data must be increased as well [17]. In this case study, the number of record is limited to 15 years, which could be consi-dered as relatively small. Therefore it is more appropriate to add the time coefficient.

Time coefficient (Ct) was used to assist the model to scope prediction into specific period. It may be Ct = 2 (wet and dry period), Ct = 4 (winter spring summer and fall period), or Ct = 12 (calendar months). This study adopted Ct = 12 as supple-mentary features. In Fig 3, Ct is added to original input data and test with ANNs (light line). The results show that using Ct with 2 previous lags provided the lowest average MAE and it can improve the prediction performance up to 26% (dash line). So, the appropriate input used in this study should be rainfall from lag 1, lag 2 and Ct.

This experimental result is related to the work of Raman and Sunilkumar [18] who studied monthly inflow time series. In hydrological process, inflow is directly affected by rainfall, consequently, the characteristics of flow graph and rainfall graph are rather similar. They suggested using data from 2 pre-vious lags to ANN models, however, instead of using a single ANN, they created twelve ANN models for each specific month and use “month” to select associated model to feed data in. If one considers this model as a black-box, one can see that their input is inflow from 2 previous lags and Ct which relative-ly similar to this study

Figure 3. Average MAE measure of ANN models among different inputs.

V. THE PROPOSED MODELS

This paper adopted the Mandani approach fuzzy inference system [20] since such model is more intuitive than the Sugeno approach [21]. To reduce the computational cost, triangular Membership Function (MF) is used. This study proposed two FIS models, namely, the Single Fuzzy Inference System (S-FIS) and the Modular Fuzzy Inference System (M-FIS), which use the concept of cooperative neuro-fuzzy technique. In S-FIS model, there is one single FIS model. Rainfall data from lag 1, lag 2 and Ct are feed directly in to the model. In M-FIS model, there are twelve FIS models associated to the calendar month. The Ct is used to select associated model to feed in the rainfall data from lag 1 and lag 2. The architectural overview of these two models is shown in the Fig 4.

Fig 5 shows the general steps to create these FIS models. The first step is to calculate the appropriate interval length be-tween two consecutive MFs and then generate Mamdani FIS rule base model. At this step, Average-Based Interval is adopted. The second step is to create fuzzy rules. In this study, Back-Propagation Neural Networks (BPNN) is used to general-ize from the training data and then used to extract fuzzy rules.

Figure 4. The architectural overview of the S-FIS (top) and M-FIS (bottom) models

Xt Xt-1 Xt-2

Ct

Xt Xt-2

Xt-1

Ct


26

Figure 5. General steps to crate the S-FIS and M-FIS models

In the S-FIS model, the MFs of Ct are simply depicted in Fig 6 (a). For rainfall input, interval length between two con-secutive MFs is very important to be defined. When the length of the interval is too large, it may not be able to represent fluc-tuation in time series. On the other hand, when it is too small the objective of FIS will be diminished.

Huarng [22] proposed the Average-Based Interval to define the appropriate interval length of MFs for fuzzy time series data based on the concept that “at least half of the fluctuations in the time series are reflected by the effective length of inter-val”. The fluctuation in time series data is the absolute value of first difference of any two consecutive data. In this method, a half of the average value of all fluctuation in time series is de-fined as the interval length of consecutive two MFs. This me-thod was successfully applied in the work reported in [23]. In this paper, this method is adapted a little bit more to fit to the nature of rainfall time series for this application.

Figure 6. An example of membership functions in TS356010’s S-FIS model, Ct (a) and Rainfall (b)

Fig 6 (b) shows the rainfall’s MFs of S-FIS from station TS356010. One can see that there are two interval lengths. The point that the interval length changes is around the 50 percen-tile of all the data. The data is separated into the lower area and the upper area by using 50 percentile as the boundary. Aver-age-based intervals are calculated for both areas. Since the be-ginning and ending rainfall periods have smaller fluctuation than middle period, using smaller interval length is more ap-propriate [2]. In the M-FIS model, using two interval lengths is not necessary since each sub model is created according to the specific month.

As mentioned before, the drawback of FIS is the lack of learning ability from data. Such model needs experts or other supplementary procedure to help to create the fuzzy rules. In this study, the proposed methodology uses BPNN to learn the generalization features from the training data [5] and then is used to extract fuzzy rules. Once the BPNN was used to extract fuzzy rules, BPNN is not used anymore. The steps to create fuzzy rules are as follows:

Step 1: Training the BPNN with the training data. At this step, the BPNN is learned and generalized from the training data.

Step 2: Preparing the set of input data. The set of input data, in this case, are all the points in the input space where the degree of MF of FIS’s input is 1 in all dimension. This input data are the premise part of the fuzzy rules.

Step 3: Feeding the input data into the BPNN, the output of BPNN are mapped to the nearest MF of FIS’s output. This out-put data are consequence part of the fuzzy rule.

For example, considering the MFs in Figure 6, the input-output [3, 500, 750:1700] is replaced with fuzzy rule “IF Ct=Mar and Lag1=A3 and Lag2=A4 THEN Predicted=A6”. This step uses 1 hidden layer BPNN. The number of hidden nodes and input nodes are 3 for S-FIS and 2 for M-FIS.

VI. EXPERIMENTAL RESULTS

The experimental results are shown in Table 2 and Table 3. In the tables, S-ANN and M-ANN are the neural networks used to create fuzzy rules for S-FIS and M-FIS respectively. In fact, the S-ANN and M-ANN themselves are also the prediction models. The performance between S-ANN and S-FIS is quite similar. It can be noted that the conversion from ANN-based to FIS-based does not reduce the prediction performance of the ANN. However, this conversion improves the S-ANN model from a qualitative point of view since M-FIS is interpretable with a set of human understandable fuzzy rules. The interesting point is the performance between M-ANN and M-FIS. This conversion can improve the performance of M-ANN.

Next, the proposed models have been compared with three conventional BJ models. The comparison results are depicted in Fig 7. Since the results from MAE and R measures are con-solidated, these experimental results are rather consistent. Simi-lar to the work by Raman and Sunilkumar [18], the AR model uses degree 2 because it uses the same input as the proposed models. The ARIMA and SARIMA models used in the study are automatically generated and optimized by statistical soft-ware. However, these generated models were also rechecked to ensure that they provided the best accuracy.

0 2 4 6 8 10 12

0

0.2

0.4

0.6

0.8

1

(a)

De

gre

e o

f m

em

be

rsh

ip

jan feb mar apr may jun jul agu sep oct nov dec

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

0

0.2

0.4

0.6

0.8

1

(b)

De

gre

e o

f m

em

be

rsh

ip

A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13

Train BPNN

Calculate Average-Based

Interval Length

Generate FIS Rule Base

and its MFs

Generate Fuzzy Rules

Training data

FIS model


27

TABLE II. MAE MEASURE OF VALIDATION PERIOD

Datasets S-ANN S-FIS M-ANN M-FIS AR ARIMA SARIMA

TS356010 450.99 447.56 560.44 496.35 747.37 747.01 538.99

TS381010 332.71 343.88 439.91 442.32 534.32 402.42 503.99

TS388002 736.70 725.39 811.99 639.29 912.64 856.88 714.74

TS407005 636.37 634.65 776.63 661.30 901.76 672.35 799.34

TABLE III. R MEASURE OF VALIDATION PERIOD

Datasets S-ANN S-FIS M-ANN M-FIS AR ARIMA SARIMA

TS356010 0.884 0.887 0.755 0.850 0.650 0.759 0.837

TS381010 0.719 0.709 0.606 0.668 0.464 0.733 0.575

TS388002 0.760 0.773 0.712 0.871 0.606 0.685 0.769

TS407005 0.768 0.770 0.633 0.736 0.594 0.755 0.681

In term of MAE, among the three BJ models, the AR model provided the lowest accuracy in all datasets. ARIMA show higher accuracy than SARIMA in two of the datasets. In station TS356010 and TS407005 the proposed model shows higher performance than all BJ models, especially the S-FIS model. In station TS381010, the ARIMA model is better than M-FIS but the performance is lower than S-FIS. In station TS388002, SARIMA model showed better performance than S-FIS but lower than M-FIS. The average normalized MAE and average R measure from all datasets are shown in the Fig 8. It can be seen from the figure that, overall, the proposed models per-formed better than the results generated from AR, ARIMA and SARIMA model.

All aforementioned results are based on quantitative point of view in order to validate the experimental results. In qualita-tive point of view, the proposed model is easier to interpret than other models because the decision mechanism of such models is in the fuzzy rules form which is close to human rea-soning [5]. Furthermore, when the models are in the form of rule base, it is easier for further enhancement and optimization by human expert. The advantage of S-FIS model is that time coefficient is expressed in term of MFs, so it is possible to ap-ply optimization method to this feature. However, a large num-ber of fuzzy rules are needed for single model. On the other hand, M-FIS model has smaller number of fuzzy rules when compared to S-FIS, but such model does not use any time fea-ture.

VII. CONCLUSION

Accurate rainfall forecasting is crucial for reservoir opera-tion and flood prevention because it can provide an extension of lead-time of the flow forecasting and many time series pre-diction models have been applied. However, the prediction mechanism of those models may be difficult to be interpreted by human analysts. This study proposed the Single Fuzzy Infe-rence System and the Modular Fuzzy Inference System, which use the concept of cooperative neuro-fuzzy technique to predict monthly rainfall time series in the northeast region of Thailand. The reported models used the average-based interval method

to determine the fuzzy interval and use BPNN to extract fuzzy rules. The prediction performance of the proposed models is compared with conventional Box-Jenkins models. The experi-mental results showed that the proposed models could be a good alternative. Furthermore, the prediction mechanism can be interpreted through the human understandable fuzzy rules.

(a)

(b)

Figure 7. The comparison performance between the purposed models and conventional Box-Jenkins models: MAE (a) and R (b).


28

(a)

(b)

Figure 8. The average normalized MAE (a) and average R (b) of all datasets

REFERENCES

[1] V. K. Somvanshi, et al., “Modeling and prediction of rainfall using artificial neural network and ARIMA techniques.” J. Ind. Geophys. Union, vol. 10, no. 2, pp. 141-151, 2006.

[2] Z. F. Toprak, et al., “Modeling monthly mean flow in a poorly

gauged basin by fuzzy logic,” Clean, vol. 37, no. 7, pp. 555-567, 2009.

[3] S. Kato and K. W. Wong, “Intelligent Automated Guided Ve-hicle with Reverse Strategy: A Comparison Study,” in Mario Köppen, Nikola K. Kasabov, George G. Coghill (Eds.) Ad-vances in Neuro-Information Processing, Lecture Notes in Computer Science, Springer-Verlag, Berlin Heidelberg, pp. 638-646, 2009.

[4] K. W. Wong, and T. D. Gedeon, "Petrophysical Properties Pre-

diction Using Self-generating Fuzzy Rules Inference System with Modified Alpha-cut Based Fuzzy Interpolation", Proceed-ings of The Seventh International Conference of Neural Infor-mation Processing ICONIP, pp. 1088-1092, November 2000, Korea.

[5] K. W. Wong, P. M. Wong, T. D. Gedeon, C. C. Fung, “Rainfall Prediction Model Using Soft Computing Technique,” Soft Computing, vol 7, issue 6, pp. 434-438, 2003.

[6] C. C. Fung, K. W. Wong, H. Eren, R. Charlebois, and H. Crock-er, “Modular Artificial Neural Network for Prediction of Petro-physical Properties from Well Log Data,” in IEEE Transactions on Instrumentation & Measurement, 46(6), December, pp.1259-1263, 1997.

[7] P. Coulibaly and N. D. Evora, “Comparison of neural network methods for infilling missing daily weather records.” Journal of Hydrology, vol. 341 pp. 27-41, 2007.

[8] C. L. Wu, K. W. Chau, and C. Fan, “Prediction of rainfall time series using modular artificial neural networks coupled with da-

ta-preprocessing techniques.” Journal of Hydrology, vol. 389, pp.146-167, 2010.

[9] W. Wang, K. Chau, C. Cheng and L. Qiu, “A comparison of performance of several artificial intelligence methods for fore-casting monthly discharge time series.” Journal of Hydrology, vol. 374, pp. 294-306, 2009.

[10] A. K. Lohani, N. K. Goel and K. K. S. Bhatia, “Comparative study of neural network, fuzzy logic and linear transfer function

techniques in daily rainfall-runoff modeling under different in-put domains.” Hydrological Process, vol. 25, pp. 175-193, 2011.

[11] P. C. Nayak, et al., A neuro-fuzzy computing technique for modeling hydrological time series, Journal of Hydrology, vol. 291. pp. 52-66, 2004.

[12] M. Z. Kermani and M. Teshnehlab, “Using adaptive neuro-fuzzy inference system for hydrological time series prediction.” Ap-plied Soft Computing, vol. 8, pp. 928-936, 2008.

[13] A. Jain and A. M. Kumar, “Hybrid neural network models for hydrologic time series forecasting.” Applied Soft Computing, vol. 7, pp. 585-592, 2007.

[14] M. Khaashei, M. Bijari, G. A. r. Ardali, “Improvement of Auto-Regressive Integrated Moving Average models using Fuzzy log-ic and Artificial Neural Networks (ANNs),” Neurocomputing, vol. 72, pp. 956-967, (2009).

[15] K. P. Sudheer, “A data-driven algorithm for constructing artifi-

cial neural network rainfall-runoff models,” Hydrological Pre-cess, vol. 16, pp. 1325-1330, (2002).

[16] C. L. Wu and K. W. Chau, “Data-driven models for monthly streamflow time series prediction.” Engineering Applications of Artificial Intelligence, vol. 23, pp. 1350-1367, 2010.

[17] S. Marsland, “Machine Learning An Algorithmic Perspective” CRC Press, 2009.

[18] H. Raman and N. Sunilkumar, “Multivariate modeling of water resources time series using artificial neural network,” Hydrolog-

ical Sciences –Journal- des Sciences Hydroligiques, vol. 40, pp.145-163, 1995.

[19] L. A. Zadeh, “Fuzzy Sets,” Inform and Control, vol. 8, pp. 338 – 353. 1965.

[20] E. H. Mamdani, and S. Assilian, “An experiment in linguistic synthesis with fuzzy logic controller,” International journal of man-machine studies, vol. 7 no. 1, pp.1-13, 1975.

[21] M. Sugeno, “Industrial application of fuzzy control,” North-

Holland, Amsterdam. 1985. [22] K. Huarng, “Effective lengths of intervals to improve forecast-

ing in fuzzy time series,” Fuzzy sets and system, vol. 123. Pp. 387-394, 2001.

[23] H. Liu and M. Wei, “An improved fuzzy forecasting method for seasonal time series,” Expert System with Applications, vol. 37, pp. 6310-6318, 2010.


29

Interval-valued Intuitionistic Fuzzy ELECTRE Method

Ming-Che Wu Graduate Institute of Business and Management College of Management, Chang Gung University

Taoyuan 333, Taiwan [email protected]

Ting-Yu Chen Department of Industrial and Business Management

College of Management, Chang Gung University Taoyuan 333, Taiwan

[email protected]

Abstract—In this study, the proposed method replaced the evaluation data from crispy value to vague value, i.e. interval-valued intuitionistic fuzzy (IVIF) data, and to develop the IVIF Elimination and Choice Translating Reality (ELECTRE) method for solving the multiple criteria decision making problems. The analyst can use IVIF sets characteristics to classify different kinds of concordance (discordance) sets using score and accuracy function, membership uncertainty degree, hesitation uncertainty index and then applied the proposed method to select the better alternatives.

Keywords-interval-valued intuitionistic fuzzy; ELECTRE; multiple criteria decision making; score function; accuracy function

I. INTRODUCTION The Elimination et Choice Translating Reality (ELECTRE)

method is one of the outranking relation methods and it was first introduced by Roy [3]. The threshold values in the classical ELECTRE method are playing an importance role to filtering alternatives, and different threshold values produce different filtering results. As we known that the evaluation data in classical ELECTRE method are almost exact values that can affect the threshold values. Moreover, in real world cases, exact values could be difficult to be precisely determined since analysts’ judgments are often vague; for these reasons, we can find some studies [4,5,8] developed the ELECTRE method with type 2 fuzzy data. Vahdani and Hadipour [4] presented a fuzzy ELECTRE method using the concept of the interval-valued fuzzy set (IVFS) with unequal criteria weights, and the criteria values are considered as triangular interval-valued fuzzy number, and also using triangular interval-valued fuzzy number to distinguish the concordance and discordance sets, and then to solve multi-criteria decision-making (MCDM) problems. Vahdani et al. [5] proposed an ELECTRE method using the concepts of interval weights and data to distinguish the concordance and discordance sets, and then to evaluate a set of alternatives and applied it to the problem of supplier selection. Wu and Chen [8] proposed an intuitionistic fuzzy (IF) ELECTRE method that using the concept of score and accuracy function, i.e. calculated the different combinations of membership, non-membership functions and hesitancy degree, to distinguish different kinds of concordance and discordance

sets, and then using the result to rank all alternatives, for solving MCDM problems.

The intuitionistic fuzzy set (IFS) was first introduced by Atanassov [1], and the IFS generalize the fuzzy set, which was introduced by Zadeh [11]. The interval-valued intuitionistic fuzzy set (IVIFS), that is combined IFS concept with interval valued fuzzy set concept, introduced by Atanassov and Gargov [2], each of which is characterized by membership function and non-membership function whose values are interval rather than exact numbers, are a very useful means to describe the decision information in the process of decision making.

As the literature review shows, few studies have applied the ELECTRE method with IVIFS to real life cases. The main purpose of this paper is to further extend the ELECTRE method to develop a new method to solve MCDM problems in interval-valued intuitionistic fuzzy (IVIF) environments. The major difference between the current study and other available papers is the proposed method, whose logic is simple but which is suitable for the vague of real life situations. The proposed method that also using the score and accuracy function, and added 2 more factors, membership and hesitation uncertainty index, i.e. applied the factors of membership, non-membership functions and hesitancy degree, to distinguish different kinds of concordance and discordance sets, and then to select the best alternatives finally. The remainder of this paper is organized as follows. Section 2 introduces the decision environment with IVIF data, the score, accuracy functions and some indices, and the construction of the IVIF decision matrix. Section 3 introduces the IVIF ELECTRE methods and its algorithm. Section 4 illustrates the proposed method with a numerical example. Section 5 presents the discussion.

II. DECISION ENVIRONMENT WITH IVIF DATA

A. Interval-valued intuitionistic fuzzy sets Based on the definition of IVIFS in Atanassov and Gargov

study [2], we have:

Definition 1: Let X be a non-empty set of the universe, and [ ]0,1D be the set of all closed subintervals of all closed

subintervals of [ ]0,1 . An IVIFS A% in X is an expression defined by

This research is supported by the National Science Council (No. NSC 99-2410-H-182-022-MY3).


30

, ( ), ( ) |A AA x M x N x x X= ⟨ ⟩ ∈% %% % %

,[ ( ), ( )],[ ( ), ( )] | ,L U L UA A A Ax M x M x N x N x x X= ⟨ ⟩ ∈% % % % (1)

where ( ) : [0,1]AM x X D→%% and ( ) : [0,1]AN x X D→%

% denote the membership degree and the non-membership degree for any x X∈ , respectively. ( )AM x%

% and ( )AN x%% are closed

intervals rather than real numbers and their lower and upper boundaries are denoted by ( )L

AM x% , ( )UAM x% , ( )L

AN x% and

( )UAN x% , respectively, and 0 ( ) ( ) 1U U

A AM x N x≤ + ≤% % .

Definition 2: [2] For each element x , the hesitancy degree of an intuitionistic fuzzy interval of x X∈ in A% defined as follows:

( ) 1 ( ) ( )A A Ax M x N xπ = − −% % %% %%

[1 ( ) ( ),1 ( ) ( )]U U L LA A A AM x N x M x N x= − − − −% % % %

[ ( ), ( )]L UA Ax xπ π= % % . (2)

Definition 3: The operations of IVIFS [2,9] are defined as follows: for two of , IVIFS( )A B X∈ ,

(a) A B⊂ iff ( ) ( )L LBAM x M x≤% % , ( ) ( )U U

BAM x M x≤% % and

( ) ( )L LBAN x N x≥% % , ( ) ( )U U

BAN x N x≥% % ;

(b) A B= iff andA B B A⊂ ⊂ ;

(c) 11

1( , ) [| ( ) ( ) | | ( )4

n L L Uj j jBA A

jd A B M x M x M x

== − +∑ % % %

( ) | | ( ) ( ) | | ( )U L L Uj j j jB BA AM x N x N x N x− + − +% % % %

( ) |];UjBN x− %

(d) 21

1( , ) [| ( ) ( ) | | ( )4

n L L Uj j jBA A

jd A B M x M x M x

n == − +∑ % % %

( ) | | ( ) ( ) | | ( )U L L Uj j j jB BA AM x N x N x N x− + − +% % % %

( ) |]UjBN x− % ;

(e) 31

1( , ) [| ( ) ( ) | | ( )4

n L L Uj j j jBA A

jd A B w M x M x M x

== − + −∑ % % %

( ) | | ( ) ( ) |U L Lj j jB BAM x N x N x− + − +% % %

| ( ) ( ) |]U Uj jBAN x N x−% % , (3)

where 1 2, ,...j nw w w w= is the weight vector of the elements

( 1,2,..., )jx j n= . The 1 2 3( , ), ( , )and ( , )d A B d A B d A B are the Hamming distance, normalized Hamming distance, and weighted Hamming distance, respectively.

B. The score, accuracy functions and some indices The studies reviews of score and accuracy functions to

handle multi-criteria fuzzy decision-making problems are as follows. At definition 1, an IVIFS A% in X is defined as

,[ ( ), ( )],[ ( ), ( )] | ,L U L UA A A AA x M x M x N x N x x X= ⟨ ⟩ ∈% % % %

% for

convenience, we call [ ( ), ( )],[ ( ),n n n

L U Ln A A AA M x M x N x= ⟨ % % %%

( )]n

UAN x ⟩% an interval-valued intuitionistic fuzzy number

(IVIFN) [10], where [ ( ), ( )] [0,1]n n

L UA AM x M x ⊂% % , [ ( ),

n

LAN x%

( )] [0,1]n

UAN x ⊂% , and ( ) ( ) 1

n n

U UA AM x N x+ ≤% % .

Xu [10] defined a score function s to measure the degree of suitability of an IVIFN nA% as follows.

1( ) ( ( ) ( ) ( ) ( ))2 n n n n

L L U Un A A A As A M x N x M x N x= − + −% % % %% , where

( ) [ 1,1]ns A ∈ −% . The larger the value of ( )ns A% , the higher the

degree of the IVIFN nA% . Wei and Wang [7] defined an

accuracy function h to evaluate the accuracy degree of an nA% as follows.

1( ) ( ( ) ( ) ( ) ( ))2 n n n n

L U L Un A A A Ah A M x M x N x N x= + + +% % % %% , where

( ) [0,1]nh A ∈% . The larger the value of ( )nh A% , the higher the degree of the IVIFN nA% . The membership uncertainty index T was proposed [6] to evaluate the membership uncertainty degree of an IVIFN nA% as follows. ( ) ( )

n

Un AT A M x= +%%

( )n

LAN x% ( ) ( )

n n

L UA AM x N x− −% % , where 1 ( ) 1nT A− ≤ ≤% . The

larger value of ( )nT A% , the smaller of the IVIFN nA% .

The hesitation uncertainty index G of a nA% is defined as

follows. ( ) ( ) ( ) ( ) ( )n n n n

U U L Ln A A A AG A M x N x M x N x= + − −% % % %% ,

and the larger value of ( )nG A% , the smaller of the IVIFN nA% .

In the study, we classify different types of concordance and discordance sets with the concepts of score, accuracy functions, membership uncertainty and hesitation uncertainty index at the proposed method.

C. Construction of the IVIF decision matrix We extend the canonical matrix format to an IVIF decision

matrix M% . An IVIFS iA% of the ith alternative on X is given by

,i j ij jA x X x X= ⟨ ⟩ | ∈% % ,

where ([ ( ), ( )],[ ( ), ( )])L U L Uij A A A AX M x M x N x N x= % % % %% .

The ijX% indicate the degrees of membership and non-membership interval of the ith alternative with respect to the jth criterion. The IVIF decision matrix M% can be expressed as follows:


31

1 11 1

1

⎡ ⎤⎢ ⎥

= ⎢ ⎥⎢ ⎥⎣ ⎦

% %L% M M O M

% %L

n

m m mn

A X XM

A X X

11 11 11 11 1 1 1 1

1 1 1 1

([ , ],[ , ]) . ([ , ],[ , ]). . .

([ , ],[ , ]) . ([ , ],[ , ])

L U L U L U L Un n n n

L U L U L U L Um m m m mn mn mn mn

M M N N M M N N

M M N N M M N N

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

. (4)

An IVIFS W , a set of grades of importance, in X is defined as follows:

,j j j jW x w x x X= ⟨ ( )⟩ | ∈ , (5)

where 0 ( ) 1j jw x≤ ≤ , 1

( ) 1n

j jj

w x=

=∑ , and ( )j jw x is the degree

of importance assigned to each criterion.

III. ELECTRE METHOD WITH IVIF DATA The proposed method is utilized the concept of score and

accuracy function to distinguish concordance set and the discordance set from the evaluation information with IVIFS data, and then to construct the concordance, discordance, concordance (discordance, aggregate) dominance matrix, respectively, and to select the best alternative from the aggregate dominance matrix finally. In this section, the IVIF ELECTRE method and its algorithm are introduced and used throughout this paper.

A. The IVIF ELECTRE method The concordance and discordance sets with IVIF data and

their definitions are as follows.

Definition 4: The concordance set klC is defined as

1 L L U Ukl kj kj kj kjC j M N M N= − + − >¢ x

+ ,L L U Ulj lj lj ljM N M N− − (6)

2 L U L Ukl kj kj kj kjC j M M N N= + + + >¢ x

+ L U L Ulj lj lj ljM M N N+ +

when ( ) ( )kj ljs X s X=% % , (7)

3

+

U L L Ukl kj kj kj kj

U L L Ulj lj lj lj

C j M N M N

M N M N

= + − − <

− −

¢ x

when ( ) ( )kj ljh X h X=% % , (8)

4

U U L Lkl kj kj kj kj

U U L Llj lj lj lj

C j M N M N

M N M N

= + − − ≤

+ − −

¢ x

when ( ) ( )kj ljT X T X=% % , (9)

where 1 2 3 4 , , , kl kl kl kl klC C C C C= , | 1,2,..., J j j n= = , and

,kj ljX X% % stand for the lower and upper boundaries of alternative k and l in criterion j, respectively.

The ( )kjs X% , ( )kjh X% and ( )kjT X% are score, accuracy fun-ction and membership uncertainty index, respectively, which are defined in section II. B.

Definition 5: The discordance set klD is defined as

1

+ ,

L L U Ukl kj kj kj kj

L L U Ulj lj lj lj

D j M N M N

M N M N

= − + − <

− −

¢ x (10)

2

+

L U L Ukl kj kj kj kj

L U L Ulj lj lj lj

D j M M N N

M M N N

= + + + <

+ +

¢ x

when ( ) ( )kj ljs X s X=% % , (11)

3

+

U L L Ukl kj kj kj kj

U L L Ulj lj lj lj

D j M N M N

M N M N

= + − − >

− −

¢ x

when ( ) ( )kj ljh X h X=% % , (12)

4

U U L Lkl kj kj kj kj

U U L Llj lj lj lj

D j M N M N

M N M N

= + − − >

+ − −

¢ x

when ( ) ( )kj ljT X T X=% % , (13)

where 1 2 3 4 , , , kl kl kl kl klD D D D D= .

The relative value of the concordance set of the IVIF ELECTRE method is measured through the concordance index. The concordance index klg between kA and lA is defined as:

klkl C j j

j Cg w xω

∈= × ( )∑ , (14)

where Cω is the weight of the concordance set, and j jw x( ) is defined in (5).

The concordance matrix G is defined as follows:

12 1

21 23 2

1 1 1

1 2 1

( ) ( )

( )

... ......

... ... ... ...... ...

...

m

m

m m m

m m m m

g gg g g

Gg g

g g g− −

−

−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥

−⎢ ⎥⎢ ⎥−⎣ ⎦

, (15)

where the maximum value of klg is denoted by *g .

The evaluation of a certain kA are worse than the eva-luation of competing lA .

The discordance index is defined as follows:


32

max ( , )

max ( , )kl

D kj ljj Dkl

kj ljj J

d X Xh

d X X

ω∈

∈

×= , (16)

where ( , )kj ljd X X is defined in (3), and Dω is the weights of discordance set on IVIF ELECTRE method.

The discordance matrix H is defined as follows:

12 1

21 23 2

( 1)1 ( 1)

1 2 ( 1)

... ......

... ... ... ...... ...

...

m

m

m m m

m m m m

h hh h h

Hh h

h h h− −

−

−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥

−⎢ ⎥⎢ ⎥−⎣ ⎦

, (17)

where the maximum value of klh is denoted by *h that is more discordant than the other cases.

The concordance dominance matrix K is defined as follows:

12 1

21 23 2

( 1)1 ( 1)

1 2 ( 1)

... ......

... ... ... ...... ...

...

m

m

m m m

m m m m

k kk k k

Kk k

k k k− −

−

−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥

−⎢ ⎥⎢ ⎥−⎣ ⎦

, (18)

where *kl klk g g= − , and a higher value of klk indicates that

kA is less favorable than lA .

The discordance dominance matrix L is defined as follows:

12 1

21 23 2

( 1)1 ( 1)

1 2 ( 1)

... ......

... ... ... ...... ...

...

m

m

m m m

m m m m

l ll l l

Ll l

l l l− −

−

−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥

−⎢ ⎥⎢ ⎥−⎣ ⎦

, (19)

where *kl kll h h= − , a higher value of kll indicates that kA is

preferred over lA .

The aggregate dominance matrix R is defined as follows:

12 1

21 23 2

( 1)1 ( 1)

1 2 ( 1)

... ......

... ... ... ...... ...

...

m

m

m m m

m m m m

r rr r r

Rr r

r r r− −

−

−⎡ ⎤⎢ ⎥−⎢ ⎥⎢ ⎥−=⎢ ⎥

−⎢ ⎥⎢ ⎥−⎣ ⎦

, (20)

where

klkl

kl kl

lr

k l=

+, (21)

klk and kll are defined in (18) and (19), and klr is in the range from 0 to 1. A higher value of klr indicates that the alternative

kA is more concordant than the alternative lA ; thus, it is a better alternative. In the best alternatives selection process,

1,

11

mk kl

l l kT r

m = ≠= ∑

−, 1, 2,...,k m= , (22)

and kT is the final value of the evaluation. All alternatives can be ranked according to the value of kT . The best alternative

*A with *

kT can be generated and defined as follows:

*( *) max k kT A T= , (23)

where *

kT is the final value of the best alternative and *A is the best alternative.

B. Algorithm The algorithm and decision process of the IVIF ELECTRE

method can be summarized in the following four steps, and there are calculate the concordance, discordance matrices, construct the concordance dominance, discordance dominance matrices and determine the aggregate dominance matrix in the Step 3. Figure 1 illustrates a conceptual model of the proposed method.

1.Construct the decision matrix

Using (4), (5)

2.Identify the concordance and discordance sets

Using (6)-(13)

3.Calculate the matrices

Using (14)-(21)

4.Choose the best alternative

Using (22),(23)

Figure 1. The process of the IVIF ELECTRE method algorithm.

IV. NUMERICAL EXAMPLE In this section, we present an example that is connected to a

decision-making problem with the best alternative selection. Suppose a potential banker intends to invest the money from four possible alternatives (companies), named A1, A2, A3, and A4. The criteria of a company is 1x (risk analysis), 2x (the growth analysis), and 3x (the environmental impact analysis) in the selection problem. The subjective importance levels of the different criteria W are given by the decision makers:


33

1 2 3 0 35 0 25 0 4W w w w= =[ , , ] [ . , . , . ] . The decision makers also give the relative weights as follows:

1 1' [ , ] [ , ]C DW w w= = . The IVIFS decision matrix decision M% is given with cardinal information:

11 11 11 11 1 1 1 1

1 1 1 1

([ , ],[ , ]) . ([ , ],[ , ]). . .

([ , ],[ , ]) . ([ , ],[ , ])

L U L U L U L Un n n n

L U L U L U L Um m m m mn mn mn mn

M M N N M M N NM

M M N N M M N N

⎡ ⎤⎢ ⎥= ⎢ ⎥⎢ ⎥⎣ ⎦

%

0 4 0 5 0 3 0 4 0 4 0 6 0 2 0 4 0 1 0 3 0 5 0 60 4 0 6 0 2 0 3 0 6 0 7 0 2 0 3 0 4 0 7 0 1 0 20 3 0 6 0 3 0 4 0 5 0 6 0 3 0 4 0 5 0 6 0 1 0 30 7 0 8

=

([ . , . ],[ . , . ]) ([ . , . ],[ . , . ]) ([ . , . ],[ . , . ])([ . , . ],[ . , . ]) ([ . , . ],[ . , . ]) ([ . , . ],[ . , . ])([ . , . ],[ . , . ]) ([ . , . ],[ . , . ]) ([ . , . ],[ . , . ])([ . , . ] 0 1 0 2 0 6 0 7 0 1 0 3 0 3 0 4 0 1 0 2

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦,[ . , . ]) ([ . , . ],[ . , . ]) ([ . , . ],[ . , . ])

( Step 1 has completed. )

Applying Step 2, the concordance and discordance sets are identified using the result of Step 1.

The concordance set, applying (6) - (9), is:

1 3 1 3 1 31 2 3 1 2 3 2 32 3 1 2 3 2 3

1 2 3 1 2 3 1 2 3

klC

−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥

−⎣ ⎦

, , ,, , , , ,

, , , ,, , , , , ,

.

For example, 24C , which is in the 2nd (horizontal) row and the 4th (vertical) column of the concordance set, are “2,3”.

The discordance set, obtained by applying (10) - (13), is as follows:

2 2 21

1 1klD

−⎡ ⎤⎢ ⎥− − −⎢ ⎥=⎢ ⎥− −⎢ ⎥− − − −⎣ ⎦

.

Applying Step 3, the concordance matrix is calculated.

0 8 0 8 0 81 1 0 5

0 5 1 0 51 1 1

. . ..

. .G

−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥

−⎣ ⎦

. For example,

1 0.35 1 0.25 1 0.40 1.0= × + × + × = .

The discordance matrix is calculated:

0 267 0 143 0 3570 0 1

0 143 0 10 0 0

. . .

.H

−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥

−⎣ ⎦

.

For example:

121 2

121 2

max ( , )0.100 0.267

max ( , ) 0.375

D j jj D

j jj J

w d X Xh

d X X∈

∈

×= = = ,

where

13 231( , ) ( 0.1 0.4 0.3 0.74

d X X = − + − +

0.5 0.1 0.6 0.2 ) 0.375− + − = ,

and

12 221( , ) 1 ( ( 0.4 0.6 0.6 0.74Dw d X X× = × − + −

0.2 0.2 0.4 0.3 )) 0.100+ − + − = .

The concordance dominance matrix is constructed as follows.

0 2 0 2 0 20 0 0 5

0 5 0 0 50 0 0

. . ..

. .K

−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥

−⎣ ⎦

.

The discordance dominance matrix is constructed as follows.

0 733 0 857 0 6431 1 0

0 857 1 01 1 1

. . .

.L

−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥

−⎣ ⎦

.

The aggregate dominance matrix is determined:

0 786 0 811 0 7631 1 0

0 632 1 01 1 1

. . .

.R

−⎡ ⎤⎢ ⎥−⎢ ⎥=⎢ ⎥−⎢ ⎥

−⎣ ⎦

.

Applying Step 4, the best alternative is chosen:

1 0.786T = , 2 0.667T = , 3 0.544T = , 4 1.000T = .

The optimal ranking order of alternatives is given by 4 1 2 3A A A Af f f . The best alternative is 4A .

V. D ISCUSSION In this study, we provide a new method, the IVIF

ELECTRE method, for solving MCDM problems with IVIF information. A decision maker can use the proposed method to gain valuable information from the evaluation data provided by users, who do not usually provide preference data. Decision makers utilize IVIF data instead of single values in the evaluation process of the ELECTRE method and use those data to classify different kinds of concordance and discordance sets

21 1 2 3C C Cg w w w w w w= × + × + ×


34

to fit a real decision environment. This new approach integrates the concept of the outranking relationship of the ELECTRE method. In the proposed method, we can classify different types of concordance and discordance sets using the concepts of score function, accuracy function, membership uncertainty degree, hesitation uncertainty index, and use concordance and discordance sets to construct concordance and discordance matrices. Furthermore, decision makers can choose the best alternative using the concepts of positive and negative ideal points. We used the proposed method to rank all alternatives and determine the best alternative. This paper is the first step in using the IVIF ELECTRE method to solve MCDM problems. In a future study, we will apply the proposed method to predict consumer decision making using a questionnaire in an empirical study of service providers selecting issue.

REFERENCES

[1] K. T. Atanassov, “Intuitionistic fuzzy sets,” Fuzzy sets and Systems, vol. 20, pp. 87-96, 1986.

[2] K. Atanassov and G. Gargov, “Interval valued intuitionistic fuzzy sets,” Fuzzy sets and Systems, vol. 31, pp. 343-349, 1989.

[3] B. Roy, “Classement et choix en présence de points de vue multiples (la méthode ELECTRE),” RIRO, vol. 8, pp. 57-75, 1968.

[4] B. Vahdani and H. Hadipour, “Extension of the ELECTRE method based on interval-valued fuzzy sets,” Soft Computing, vol. 15, pp. 569-579, 2011.

[5] B. Vahdani, A. H. K. Jabbari, V. Roshanaei, and M. Zandieh, “Extension of the ELECTRE method for decision-making problems with interval weights and data,” International Journal of Advanced Manufacturing Technology, vol. 50, pp. 793-800, 2010.

[6] Z. Wang, K. W. Li, and W. Wang, “An approach to multiattribute decision making with interval-valued intuitionistic fuzzy assessments and incomplete weights,” Information Sciences, vol. 179, pp. 3026-3040, 2009.

[7] G. W. Wei, and X. R. Wang, “Some geometric aggregation operators on interval-valued intuitionistic fuzzy sets and their application to group decision making,” International conference on computational intelligence and security, pp. 495-499, December 2007.

[8] M.-C. Wu and T.-Y. Chen, “The ELECTRE multicriteria analysis approach based on Atanassov's intuitionistic fuzzy sets,” Expert Systems with Applications, vol. 38, pp. 12318-12327, 2011.

[9] Z. S. Xu, “On similarity measures of interval-valued intuitionistic fuzzy sets and their application to pattern recognitions,” Journal of Southeast University, vol. 23, pp. 139 -143, 2007a.

[10] Z. S. Xu, “Methods for aggregating interval-valued intuitionistic fuzzy information and their application to decision making,” Control and Decision, vol. 22, pp. 215 -219, 2007b.

[11] L. A. Zadeh, “Fuzzy Sets,” Information and Control, vol. 8, pp. 338-353, 1965.


35

Optimizing of Interval Type-2 Fuzzy Logic Systems Using HybridHeuristic Algorithm Evaluated by Classification

Adisak Sangsongfa, and Phayung MeesadDepartment of Information Thecnology, Faculty of Information Technology

King Mongkut’s University of Thecnology North Bangkok, Bangkok, ThailandEmail: adisak [email protected], [email protected]

Abstract— In this research, an optimization of the rule baseand the parameter of interval type-2 fuzzy set generation bya hybrid heuristic algorithm using particle swarm and geneticalgorithms is proposed for classification application . For the Irisdata set, 90 records were selected randomly for training, and therest, 60 records, were used for testing. For the Wisconsin BreastCancer data set, the author deleted the missing attribute valueof 16 records and randomly selected 500 records for training,and the rest, 183 records, were used for testing. The proposedmethod was able to minimize rulebase, minimize linguisticvariable and produce a accurate classification at 95% with thefirst dataset and 98.71% with the second dataset.

Keywords-Interval Type-2 Fuzzy Logic Systems; GA; PSO;

I. INTRODUCTION

In 1965, Lotfi A. Zadeh, professor for computer science atthe University of California in Berkley, developed a fuzzylogic system which has been widely used in many areassuch as decision making, classification, control, prediction,optimization and so on. However, the fuzzy logic systemcomes from the original system that is called the type-1 fuzzyset. Sometimes it cannot solve certain problems, especiallyproblems that are very large, complex and/or uncertain.Therefore, in 1975 Zadeh developed and formulated a type-2fuzzy set to meet the needs of data sets which are complexand uncertain. Thus, the type-2 fuzzy set has been usedwidely and continuously in many cases [1].

Recently, there has been growing interest in the inter-val type-2 fuzy set which is a special case of the type-2 fuzzy set. Because, Mendel and John [2] reformulatedall set operations in both the vertical-slice and wavy-slicemanner. They concluded that general particle type-2 fuzzyset opeartions are too complex to understand and implement,but operations using the interval type-2 fuzzy set involveonly simple interval arithmetics which means computationcosts are reduced. The interval type-2 fuzzy set consists offour parts: fuzzification, fuzzy rule base, inference engine anddefuzzifications. Moreover, the fuzzy rule base and intervaltype-2 fuzzy sets are complicated when determining the exactmembership function and complete fuzzy rule base. So, the

optimization of interval type-2 fuzzy set and fuzzy rule basemust be used to estimate the value by an expert system.Many researchers have proposed and introduced optimizationof interval type-2 fuzzy set and fuzzy rule base such asZhao [3] proposed adaptive interval type-2 fuzzy set usinggradient descent algorithms to optimize inference enginefuzzy rule base, Hidalgo [4] proposed optimization intervaltype-2 fuzzy set applied to modular neural network usinga genetic algorithm. Moreever, many reseachers apply theinterval type-2 fuzzy logic system for uncertain datasets.Also, thw creation of an optimized interval type-2 fuzzy logicsystem will gain the maximum accurate outputs. There arealso many optimization techniques which have been proposedfor building interval type-2 fuzzy systems. Some traditionaloptimization techniques are based on mathematics and someare based on heuristic algorithms. Some optimization tech-niques are often difficult and time consumming such asheuristic optimization. Sometimes, the improvment of theheruistic algorithms provides good performance such as thehybrid heruistic algorithms [5]. Moreover, hybrid heuristic isa much younger algorithm candidate compared to the geneticalgorithm and particle swarm optimization in the domain ofmeta- heuristic-based optimization.

In this paper, a new algorithm called the hybrid heuristicalgorithm which combines a genetic algorithm to particleswarm optimization is proposed. Also, a presention of anoptimization of interval type-2 fuzzy set and fuzzy rulebase using the proposed hybrid heuristic algorithm. Thealgorithm will be used to optimize a model by minimizing thenumber of fuzzy rules, minimizing the number of linguisticvariable and maximizing the accuracy of the output. Thenthe framework and the corresponding algorithms are testedand evaluated to prove the concept by applying it to theIris dataset [6] and Wisconsin Breast Cancer Dataset as anexample of classification [7].


36

II. RELATED WORK

A. Particle Swarm Optimization(PSO)

The PSO initializes a swarm of prticles at random, witheach particle deciding its new velocity and position based onits past optimal position P1 and the past optimal position ofthe swarm Pg . Let xi=(xi1, xi2..., xin) represent the currentposition of particle i, vi=(vi1, vi2,..., vin) its current velocityand Pi=(pi1, pi2,...,pin) its past optimal position, then theparticle uses the following equation to adjust its velocity andposition:

Vi,(t+1) = wVi,(t) + c1r1(Pi − xi,(t) + c2r2(Pg − xi,(t) (1)

xi,(t+1) = xi,(t) + Vi,(t+1) (2)

where c1 and c2 are constants of acceleration in the rangeof 0..2, r1 and r2 are random number in [0,1] and w is theweight of inertia, which is used to maintain the momentum ofthe particle. The first term on the right hand side in (1) is theparticle’s velocity in time t. The second term represents “selflearning” by the particle based on its own history. The lastterm reflects “social learning” through information sharingamong individual particles in the swarm. All three partscontribute to the particle’s search ability in the space analyzedwhich simulates the swarm behavior mathematically [8].

B. Genetic Algorithm(GA)

A GA generally has four components: 1) a population ofindividuals where each individual in the population repre-sents a possible solution, 2) a fitness function which is anevaluation function by which we can tell if an individual isa good solution or not, 3) a selection function which decideshow to pick good individuals from the current population forcreating the next generation, and 4) genetic operators such ascrossover and mutation which explore new regions of searchspace while keeping some of the current information at thesame time.

GAs are based on genetics, especially on Darwins theory(survival of the fittest). This states that the weaker membersof a species tend to die away, leaving the stronger andfitter. The surviving members create offspring and ensurethe continuing survival of the species. This concept togetherwith the concept of natural selection is used in informationtechnology to enhance the performance of computers [9].

C. Interval Type-2 Fuzzy Set

Interval type-2 fuzzy sets are particularly useful when it isdifficult to determine the exact membership function, or inmodeling the diverse options from different individulas. Themembership function, which interval type-2 fuzzy inferencesystem approximates expert knowledge and judgment inuncertain conditions, this can be constructed from surveys orusing optimization algorithms. Its basic framework consistsof four basic parts: fuzzification, fuzzy rule base, fuzzyinference engine and defuzzification shown in Fig. 1.

Fig. 1. Interval Type-2 Fuzzy System

We can describe the interval type-2 fuzzy logic system asfollows: the crisp sets inputs are first fuzzified into inputinterval type-2 fuzzy sets. In the fuzzifier, it creats themembership function which consists of types of membershipfunction, linguistic variable and fuzzy rule base. It has manytypes of the membership function such as triangular mem-bership function, trapezoidal membership function, Gaussianmembership function, Smooth Membership Function, Z-membership function and so on. So, the fuzzifier sends theinterval type-2 fuzzy set into the inference engine and the rulebase to produce output type-2 fuzzy sets. The interval type-2fuzzy logic system rules will remain the same as in the type-1 fuzzy logic system, but the antecedents and/or consequentswill be represented by interval type-2 fuzzy sets. A finitenumber of fuzzy rules, can be represented as if-then forms,then integrates into the fuzzy rule base. A standard fuzzy rulebase is shown below.

R1 : If x1 is A11 and x2 is A1

2, ..., xn is A1n Then y is B1.

R2 : If x1 is A21 and x2 is A2

2, ..., xn is A2n Then y is B2.

...

RM : If x1 is AM1 and x2 is AM

2 , ..., xn is AMn Then y is BM .

where x1,...,xn are state c=variables, y is control variable.The linguistic value ˜

Aj1,..., ˜Aj

n and Bj , (j=1,2,...,M) are re-spectively defined in the universe U1,...Un and V. In fuzzifi-cation, crisp input variable xi is mapped into interval Type-2


37

fuzzy set Axi, i = 1, 2, ..., n. The inference engine combines

all the fired rules and gives a non-linear mapping fromthe input interval type-2 fuzzy logic systems to the outputinterval type-2 fuzzy logic systems. The multiple antecedentsin each rule are connected by using the Meet operation, themembership grades in the input sets are combined with thosein the output sets by using the extended sup-star composition,and multiple rules are combined by using the Join operation.The type-2 fuzzy outputs of the inference engine are thenprocessed by the typereducer, which combines the output setsand performs a centroid calculation that leads to type-1 fuzzysets called the type-reduced sets. After the type-reductionprocess, the type-reduced sets are then defuzzified (by takingthe average of the type-reduced) to obtain crisp outputs. [3].

In the interval type-2 fuzzy logic system design, weassumed Z-membership function for the first membershipfunction, triangular membership function for the secondarymembership function and smooth membership function forthe last membership function, center of sets type reductionand defuzzification using the centroid of the typereduced set.

III. THE PROPOSED FRAMEWORK

In our framework, we present the new algorithms of hybridheuristic algorithm which are developed to optimize theinterval type-2 fuzzy logic system using Iris datasets andbreast cancer datasets. The new algorithm to optimize theinterval type-2 fuzzy sets and fuzzy rule base uses hybridheuristic searches which are a sequential combination of GAand PSO. The proposed algorithm will be used to optimizethe number of linguistic variables, parameters of membershipfunctions and the rule base which consists of constraint ofthe minimum linguistic variable, minimum rule base andmaximum accuracy. The framework is shown in Fig. 2.

From the framework, we can describe the steps of theproposed method for optimized interval type-2 fuzzy setand fuzzy rule base using hybrid heuristic searches. Theframework is given in four steps described below.

Step 1: Determine the structure of interval type-2 fuzzysystem framework.

Step 2: Determine the fuzzy rules base using clustering.Step 3: Determine the universes of the input and output

variables and their type of membership functions and lin-guistic parameter of membership functions.

Step 4: Determine and optimize the fuzzy inference engineusing the hybrid heuristic algorithms which is a combinationof GA and PSO.

Fig. 2. Framework of Optomization Interval Type-2 Fuzzy System UsingHybrid Heuristic Algorithms

1) Determine the structure of interval fuzzy type-2 system

framework

In Fig. 2. the framework shows the structure of theoptimization interval type-2 fuzzy sets and rule based onhybrid heuristic algorithms. The hybrid heruistic algorithmused sequential hybridization. The GA is used for the firstlocal optimal interval type-2 fuzzy sets which consist of in-terval type-2 membership function, interval type-2 linguisticparameter (LMF, UMF) and rule base. Moreover, the PSO isused for the last optimal which is a gaining the best resultdon’t care rule.

2) Determine the fuzzy rules base using clustering

We used the K-means clustering algorithm [10] to groupthe dataset to determine the feasibility of a fuzzy rule base. Astandard K-means clustering algorithm is shown as follows.

J =k∑

j=1

n∑i=1

‖ xji − cj ‖2, (3)

where K is clusters, ‖ xji − cj ‖2 is a chosen distence

measure between a data set point xji and the cluster centre

cj , is an indicator of the distence of the n data points fromtheir respective cluster centres.

3) Determine the universes of the input and output vari-

ables and their type of the membership functions

In the universe of input and output variables and theirprimary membership functions, the z-membership function,triangular membership function and smooth membershipfunction were used and are shown in Fig. 3. In Fig. 3.,the presention the four attributes of Iris membership func-tion are displayed and graded as attibute1=2, attribute2=2,


38

TABLE IPREDEFINED MEMBERSHIP FUNCTION FOR FIVE LINGUISTIC

VARIABLE.

Linguistic Index Linguistic Terms0 Don’t Care1 Very Low2 Low3 Medium4 High5 Very High

attribute3=5 and attribute4=5. The definiion of the linguisticlabel and number of linguistic variables are in Table 1.

Fig. 3. The Example of Interval type-2 Membership Functions

4) Determine and optimization the fuzzy inference engine

using the hybrid heuristic algorithms

Firstly, encoding the fuzzy rule based system into geno-type or chromosome. Each chromosome represents a fuzzysystem composed of the number of linguistic variables ineach dimension, the membership function parameters of eachlinguistic variable, and the fuzzy rules which consists of don’tcare rules from the PSO. A chromosome (chrom) consists of4 parts or genes:

chrom =

by GA︷︸︸︷[IM, IL,R

by PSO︷︸︸︷, DcR] (4)

where IL = [IL1, IL2, ...ILn] is a set of thenumber of interval linguistic variables, IM =

[im11, im12, .., imn,ILn] is a set of the interval

membership function parameters of the interval linguisticvariables, R = [R1, R2, ..., RIL1×IL2×...×ILn

] isthe fuzzy rule. R1 is interger number that is theindex of linguistic variable of each dimension, andDcR=[Ra111L1×L2×,...,×Ln

, Ra112L1×L2×,...,×Ln, ...,

RalmkL1×L2×,...,×Ln] is the don’t care rule.

RalmkL1×L2×,...,×Lnis the integer number which is the

index of the don’t care rule of each rule. The length of achromosome can be varied depending on the fuzzy partition

created by cross sections of the linguistic variables fromeach dimension. Then, the Fitness Function is

Fit = Acc(chromi)

where chromi = [chrom1, chrom2, ..., chromn] is a set ofthe chromosome number. The accuracy (Acc) is

Acc =Number of Correct Classification

Total Number of Training Data

IV. THE EXPERIMENTAL EVALUATION SETTING UP

To evaluate the proposed Hybrid Heuristic Type-2(HHType-2) algorithm for building interval type-2 fuzzysystems, two datasets were used which are benchmark clas-sification datasets from UCI data repository for machinelearning, Fishers iris data and Wisconsin Breast Cancer data.

A. Datasets

Iris dataset has 4 variables with 3 classes; 90 recordswere selected randomly for training, and the rest, 60 records,were used for testing. Wisconsin Breast Cancer data set has699 records, the missing attribute value of 16 records weredeleted. Each record consists of 9 features plus the classattribute; 500 records were selected randomly for training,and the rest, 183 records, were used for testing.

Fig. 4 shows the scatter plot of the Iris dataset, Fig. 5illustrates the scatter plot of the Iris dataset with clusteringusing K-Mean algorithms (K=7). Fig. 6 shows the scatterplot of the Wisconsin Breast Cancer dataset, and Fig. 7shows the scatter plot of the Wisconsin Breast Cancerdataset with clustering using K-Mean algorithms (K=4).

Fig. 4. The scatter plot of Iris Dataset (* represents Setosa, × representsVersicolor, and ? represents Verginica)

B. Experimental Results

The experiments were performed on a MacBook Pro IntelCore 2 Duo CPU, speed 2.66 GHz, ram 4.00 GB RAM,


39

Fig. 5. The scatter plot of Iris Dataset with Clustering (* represents Setosa,× represents Versicolor, and ? represents Verginica)

Fig. 6. The scatter plot of Wisconsin Breast Cancer Dataset (* representsClass 2, and ? represents Class 4)

running on Mac os. All algorithms are implemented usingMatlab.

The first dataset (Iris dataset), ran 20 times with the aver-ages execute time of 662.2635s. The simulation populationwas 100 individuals. Then, the largest individuals from thePSO were used to optimize the “don’t care” rule. In the PSO,each of the individuals were simulated with 50 swarms and5 particles. The PSO completed 20 runs with the excite timeof 429.7597s.

In the second dataset (Wisconsin Breast Cancer (WBC)),ran 20 times with the average execute time of 3679.2428s.The simulation population was 100 individuals. The individ-uals from PSO were used to optimize the “don’t care” rule.The individuals of the PSO were simulated with 50 swarms

Fig. 7. The scatter plot of Wisconsin Breast Cancer Dataset with Clustering(* represents Class 2, and ? represents Class 4)

TABLE IICONFUSION MATRIX FOR THE IRIS CLASSIFICATION DATA.

Dataset Membership Rule Class AccIris [2 2 5 5] 0 0 1 1 1

0 1 3 3 22 1 5 5 3

Total Acc 95%WBC [2 2 3 2 2 3 2 2 2] 0 0 0 0 0 0 0 0 0 2

1 0 1 0 2 1 2 2 2 42 2 3 2 2 1 2 2 2 4

Total Acc 98.71%

TABLE IIICONFUSION MATRIX FOR THE IRIS CLASSIFICATION DATA.

Attribute Setosa Versicolor Verginica Total TestingSetosa 20 0 0 20

Versicolor 0 19 1 20Verginica 0 2 18 20

Total 20 21 19 60

and 5 particles. Then, the PSO completed 20 runs with theexecute time of 2387.5543s. The optimal fuzzy system whichwas optimized using the hybrid heuristic algorithm generatedaccuracy performance as shown in Tables 2, 3. An exampleof a chromosome from the WBC datasets is shown in Fig.8.

Membership︷︸︸︷[2 2 3 2 2 3 2 2 2]

Linguisticparameter︷︸︸︷1.9782 3, 4612 7.8462 9.1217 3.3353

3.3353 6.5211 1.8434 4.2727 1.0098 1.0312 1.6815 8.39991.9247 3.5459 1.9992 5.2612 1.0692 1.1521 2.1435 2.15563.6942 7.6163,........,2.6585 3.1273 7.0503 9.8831 3.91313.1549 6.9534 111111111 1 222123221 4 223222221 4000000000 2 101021222 4 223221222 4︸︷︷︸

RuleBased

Fig. 8. Chromosome of Interval type-2 Fuzzy Logic System WBC dataset

To prove the excellent performance of this proposed frame-work, we compared its accuracy with other well-known clas-sifiers, manipulated for the same probem. Table 4 presentsthe accuracy performance of classifiers with these algorithms.From Table 4, it can be seen that the accuracy performanceof the proposed hybrid heuristic algorithm is among the bestachieved.

In the same way, we compared the results of the confi-dence gained from experiments using the algorithms withthe same problem to other algorithms. Table 5 shows theaccuracy performance of classifier from these algorithmsand the confidence of the Wisconsin Breast Cancer datasetusing the Hybrid Heuristic Type-2 (HHType-2) algorithm,which results were competitive or even better than any otheralgorithm. Although GA and PSO are not new, when the twocome together they make a powerful new algorithm (Hybrid


40

Fig. 9. The Bar chart of comparisons of the HHType-2 and the otheralgorithms, for the Iris data.

Fig. 10. The Bar chart of comparisons of the HHType-2 and the otheralgorithms, for the WBC data.

Heuristic Type-2) for optimization which it is quite efficientreferring to the performance.

V. CONCLUSION

In this paper, a methodology based on a hybrid heuristicalgorithm, a combination of PSO and GA approaches, isproposed to build interval type-2 fuzzy set for classification.The algorithms are used to optimize a model by minimizingthe number of fuzzy rule, minimizing the number of linguisticvariable and maximizing the accuracy of the fuzzy rule base.The performance of the proposed hybrid heuristic algorithmwas demonstrated well by applying it to the benchmarkproblem and the comparison with several other algorithms.

For the future research, the application of the proposedalgorithm to other problems such as intrusion detectionnetwork, network forensic etc., and the use of lerger datasetsthan this research such as Breast Cancer Diagnosis, trafficnetwork dataset etc, will be covered. Therefore, an adaptiveon-line inference engine of the interval type-2 fuzzy set willbe selected for future research of Breast Cancer Diagnosisfor medical training and testing.

REFERENCES

[1] J. M. Mendel, “Why we need type-2 fuzzy logic system ?” May 2001,http://www.informit.com/articles/article.asp.

[2] J. M. Mendel and R. I. B. John, “Type-2 fuzzy sets made simple,”IEEE Trans. Fuzzy Syst, vol. 10, pp. 117–127, April 2002.

TABLE IVCOMPARISONS OF THE HHTYPE-2 AND THE OTHER ALGORITHMS, FOR

THE IRIS DATA.

Algorithm Setosa Versicolor Verginica Acc1.VSM [11] 100% 93.33% 94% 95.78%

2.NT-growth [11] 100% 93.5% 91.13% 94.87%3.Dasarathy [11] 100% 98% 86% 94.67%

4.C4 [11] 100% 91.07% 90.61% 93.87%5.IRSS [12] 100% 92% 96% 96%

6.PSOCCAS [13] 100% 96% 98% 98%7.HHTypeI [5] 100% 97% 98% 98%8. HHType II 100% 95% 90% 95%

TABLE VCOMPARISONS OF THE HHTYPE-2 AND THE OTHER ALGORITHMS, FOR

THE WBC DATA.

Algorithm Accuracy1. SANFIS [14] 96.07%2. FUZZY [15] 96.71%3. ILFN [15] 97.23%

4. ILFN-FUZZY [15] 98.13%5. IGANFIS [16] 98.24%

6. HHType II 98.71%

[3] L. Zhao, “Adaptive interval type-2 fuzzy control based on gradientdescent algorithm,” in Intelligent Control and Information Processing(ICICIP), vol. 2, 2011, pp. 899–904.

[4] D. Hidalgo, P. Melin, O. Castillo, and G. Licea, “Optimization ofinterval type-2 fuzzy systems based on the level of uncertainty, appliedto response integration in modular neural networks with multimodalbiometry,” in The 2010 International Joint Conference on DigitalObject Identifier, 2010, pp. 1–6.

[5] A. Sangsongfa and P. Meesad, “Fuzzy rule base generation by a hybridheuristic algorithm and application for classification,” in NationalConference on Computing and Information Technology, vol. 1, 2010,pp. 14–19.

[6] IrisDataset, http://www.ailab.si/orange/doc/datasets/Iris.htm.[7] Breast Cancer Dataset, http://www.breastcancer.org.[8] J. Zeng and L. Wang, “A generalized model of particle swarm

optimization,” Pattern Recognition and Artificial Intelligence, vol. 18,pp. 685–688, 2005.

[9] H. Ishibuchi, T. Nakashima, and T. Murata, “Three-objective geneticbased machine learning for linguistic rule extraction,” InformationSciences, vol. 136, pp. 109–133, 2001.

[10] R. Salman, V. Kecman, Q. Li, R. Strack, and E. Test, “Fast k-meansalgorithm clustering,” Transactions on Machine Learning and DataMining, vol. 3, p. 16, 2011.

[11] T. P. Hong and J. B. Chen, “Processing individual fuzzy attributes forfuzzy rule induction,” in Fuzzy Sets and Systems, vol. 10, 2000, pp.127–140.

[12] A. Chatterjee and A. Rakshit, “Influential rule search scheme (irss)anew fuzzy pattern classifier,” in IEEE Transaction on Knowledge andData Engineering, vol. 16, 2004, pp. 881–893.

[13] L. Hongfei and P. Erxu, “A particle swarm optimization- aided fuzzycloud classifier applied for plant numerical taxonomy based on attributesimilarity,” in Expert Systems with Applications, vol. 36, 2009, pp.9388–9397.

[14] H. Song, S. Lee, D. Kim, and G. Park, “New methodology of computeraided diagnostic system on breast cancer,” in Second InternationalSymposium on Neural Networks, 2005, pp. 780–789.

[15] P. Meesad and G. Yen, “Combined numerical and linguistic knowledgerepresentation and its application to medical diagnosis,” in Componentand Systems Diagnostics, Prognostics, and Health Management II,2003.

[16] M. Ashraf, L. Kim, and X. Huang;, “Information gain and adap-tive neuro-fuzzy inference system for breast cancer diagnoses,” inComputer Sciences and Convergence Information Technology (ICCIT),2010, pp. 911–915.


41

Neural Network Modeling for an Intelligent

Recommendation System Supporting SRM for

Universities in Thailand

Kanokwan Kongsakun

School of Information Technology

Murdoch University, South Street,

Murdoch,WA 6150 AUSTRALIA

[email protected]

Jesada Kajornrit




[email protected]

Chun Che Fung




[email protected]

Abstract— In order to support the academic management

processes, many universities in Thailand have developed

innovative information systems and services with an aim to

enhance efficiency and student relationship. Some of these

initiatives are in the form of a Student Recommendation

System(SRM). However, the success or appropriateness of such

system depends on the expertise and knowledge of the counselor.

This paper describes the development of a proposed Intelligent

Recommendation System (IRS) framework and experimental

results. The proposed system is based on an investigation of the

possible correlations between the students’ historic records and

final results. Neural Network techniques have been used with an

aim to find the structures and relationships within the data, and

the final Grade Point Averages of freshmen in a number of

courses are the subjects of interest. This information will help the

counselors in recommending the appropriate courses for students

thereby increasing their chances of success.

Keywords-Intelligent Recommendation System; Student

Relationship Management; data mining; neural network

I. INTRODUCTION

The growing complexity of technology in educational

institutions creates opportunities for substantial improvements

for management and information systems. Many designs and

techniques have allowed for better results in analysis and

recommendations. With this in mind, universities in Thailand

are working hard to improve the quality of education and

many institutes are focusing on how to increase the student

retention rates and the number of completions. In addition, a

university’s performance is also increasingly being used to

measure its ranking and reputation [1]. One form of service

which is normally provided by all universities is Student

Counseling. Archer and Cooper [2] stated that the provision of

counseling services is an important factor contributing to

students’ academic success. In addition, Urata and Takano [3]

stated that the essence of student counseling should include

advices on career guidance, identification of learning

strategies, handling of inter-personal relation, along with self-

understanding of the mind and body. It can be said that a key

aspect of student services is to provide course guidance as this

will assist the students in their course selection and future

university experience.

On the other hand, many students have chosen particular

courses of study just because of perceived job opportunities,

peer pressure and parental advice. Issues may arise if a student

is not interested in the course, or if the course or career is not

suitably matched with the student’s capability[4]. In

Thailand’s tertiary education sector, teaching staff may have

insufficient time to counsel the students due to high workload

and there are inadequate tools to support them. Hence, it is

desirable that some forms of intelligent recommendation tools

could be developed to assist staff and students in the

enrolment process. This forms the motivation of this research.

One of the initiatives designed to help students and staff is

the Student Recommendation System. Such system could be

used to provide course advice and counseling for freshmen in

order to achieve a better match between the student’s ability

and success in course completion. In the case of Thai

universities, this service is normally provided by counselors or

advisors who have many years of experience within the

organisation. However, with increasing number of students

and expanded number of choices, the workload on the advisors

is becoming too much to handle. It becomes apparent that

some forms of intelligent system will be useful in assisting the

advisors.

In this paper, a proposed intelligent recommendation

system is reported. This paper is structured as follows. Section

2 describes literature reviews of Student Relationship

Management (SRM) in universities and issues faced by Thai

university students. Section 3 describes Neural Network

techniques which are used in the reported Intelligent

Recommendation System, and Section 4 focuses on the

proposed framework, which presents the main idea and the

research methodology. Section 5 describes the experiments

and the results. This paper then concludes with discussions on

the work to be undertaken and future development.


42

mailto:[email protected]



II. LITERATURE REVIEW

A. Student Relationship Management in Universities According to literature, the problem of low student

retention in higher education could be attributed to low student satisfaction, student transfers and drop-outs [5]. This issue leads to a reduction in the number of enrolments and revenue, and increasing cost of replacement. On the other hand, it was found that the quality and convenience of support services are other factors that influence students to change educational institutes [6]. Consequently, the concept of SRM has been implemented in various universities so as to assist the improvement of the quality of learning processes and student activities.

Definitions of SRM have been adopted from the

established practices of Customer Relationship Management

(CRM) which focuses on customers and are aimed to establish

effective competition and new strategies in order to improve

the performance of a firm [7]. In the case of SRM, the context

is within the education sector. Although there have been many

research focused on CRM, few research studies have

concentrated on SRM. In addition, the technological supports

are inadequate to sustain SRM in universities. For instance, a

SRM system’s architecture has been proposed so as to support

the SRM concepts and techniques that assist the university’s

Business Intelligent System [8]. This project provided a tool to

aid the tertiary students in their decision-making process. The

SRM strategy also provided the institution with SRM

practices, including the planned activities to be developed for

the students, as well other relevant participants. However, the

study verified that the technological support to the SRM

concepts and practices were insufficient at the time of writing

[8].

In the context of educational institutes, the students may

be considered having a role as “customers”, and the objective

of Student Relationship Management is to increase their

satisfaction and loyalty for the benefits of the institute. SRM

may be defined under a similar view as CRM and aims at

developing and maintaining a close relationship between the

institute and the students by supporting the management

processes and monitoring the students’ academic activities

and behaviors. Piedade and Santos (2008) explained that

SRM involves the identification of performance indicators

and behavioral patterns that characterize the students and the

different situations under which the students are supervised.

In addition, the concept of SRM is “understood as a process

based on the student acquired knowledge, whose main

purpose is to keep a close and effective students institution

relationship through the closely monitoring of their academic

activities along their academic path” [9]. Hence, it can be said

that SRM can be utilised as an important means to support

and enhance a student’s satisfaction. Since understanding the

needs of the students is essential for their satisfaction, it is

necessary to prepare strategies in both teaching and related

services to support Student Relationship Management. This

paper therefore proposes an innovative information system to

assist students in universities in order to support the SRM

concept.

B. Issues Faced By Thai University Students

Another study at Dhurakij Pundit University, Thailand

looked at the relationship between learning behaviour and low

academic achievement (below 2.0 GPA) of the first year

students in the regular four-year undergraduate degree

programs. The results indicated that students who had low

academic achievement had a moderate score in every aspect of

learning behaviour. On average, the students scored highest in

class attendance, followed by the attempt to spend more time

on study after obtaining low examination grades. Some of the

problems and difficulties that mostly affected students’ low

academic achievement were the students’ lack of

understanding of the subject and lack of motivation and

enthusiasm to learn [10].

Moreover, some other studies had focused on issues

relating to students’ backgrounds prior to their enrolment,

which may have effects on the progress of the students’

studies. For example, a research group from the Department of

Education[11], Thailand studied the backgrounds of 289,007

Grade twelve students which may have affected their

academic achievements. The study showed that the factors

which could have effects on the academic achievement of the

students may be attributed to personal information such as

gender and interests, parental factors such as their jobs and

qualifications, and information on the schools such as their

sizes, types and ranking.

Therefore, in the recruitment and enrolment of students in

higher education, it is necessary to meet the student’s needs

and to match their capability with the course of their choice.

The students’ backgrounds may also have a part to play in the

matching process. Understanding the student’s needs will

implicitly enhance the student’s learning experience and

increase their chances of success, and thereby reduce the

wastage of resources due to dropouts, and change of programs.

These factors are therefore taken into consideration in the

proposed recommendation system in this study.

III. NEURAL NETWORK BASED INTELLIGENT

RECOMMENDATION SYSTEM TO SUPPORT SRM

In term of education systems, Ackerman and Schibrowsky

[12] have applied the concept of business relationships and

proposed the business relationship marketing framework. The

framework provided a different view on retention strategies

and an economic justification on the need for implementing

retention programs. The prominent result is the improvement

of graduation rates by 65% by simply retaining one additional

student out of every ten. The researcher added that this

framework is appropriate both on the issues of places on

quality of services. Although some problems could not be

solved directly, it is recognized that Information and

Communication Technologies (ICT) can be used and

contributes towards maintaining a stronger relationship with

students in the educational systems [8].

In this study, a new intelligent Recommendation System is

proposed to support universities students in Thailand. This

System is a hybrid system which is based by Neural Network

Identify applicable sponsor/s here. (sponsors)


43

and Data Mining techniques; however, this paper only focuses

on the aspect of Neural Network (NN) techniques.

With respect to the Neural Network algorithm that was used

in this study, the feed-forward neural network, also called

Multilayer Perceptron was used. In the training of a Multilayer

perceptron, back propagation learning algorithm (BP) was

used to perform the supervised learning process [13]. The

feed-forward calculations which use in this experiment, the

activations set to the values of the encoded input fields in

Input Neurons. The activation of each neuron in a hidden or

output layer is calculated and shown as follows:

bi = σ ( ∑jѡijPj ) (1)

where bi is the activation of neuron i, j is the set of neurons

in the preceding layer, ѡij is the weight of the connection

between neuron i and neuron j, Pj is the output of neuron j, and

σ ( m ) is the sigmoid or logistic transfer function, which show

as follows

σ(m) = 1/(1+e-m

) (2)

The implementation of back propagation learning updates

the network weights and biases in the direction in which the

system performance increases most rapidly.

This study used a feed-forward network architecture and the

Mean Absolute Error (MAE) to define the accuracy of the

models.

IV. THE PROPOSED FRAMEWORK

Several solutions have been proposed to support SRM in the

universities; however, not many systems in Thailand have

focused on recommendation systems using historic records

from graduated students. A recommendation system could

apply statistical, artificial intelligence and data mining

techniques by making appropriate recommendation for the

students. Figure 1 illustrates the proposed recommendation

system architecture. This proposal aims to analyse student

background such as the high school where the student studied

previously, school results and student performance in terms of

GPA’s from the university’s database. The result can then be

used to match the profiles of the new students. In this way, the

recommendation system is designed to provide suggestions on

the most appropriate courses and subjects for the students,

based on historical records from the university’s database.

A. Data-Preprocessing

Initially, data on the student records are collected from the

university enterprise database. The data is then re-formatted in

the stage of data transformation in order to prepare for

processing by subsequent algorithms. In the data cleaning

process, the parameters used in the data analysis are identified

and the missing data are either eliminated or filled with null

values [15]. Preparation of analytical variables is done in the

data transformation step or being completed in a separate

process. Integrity of the data is checked by validating the data

Figure 1. Proposed Hybrid Recommendation System Framework to Support

Student Relationship Management

against the legitimate range of values and data types. Finally,

the data is separated randomly into training and testing data

for processing by the Neural Network.

B. Data Analysis

It can be seen in Fig. 1 that the Association rules, Decision

Tree, Support Vector Machines and Neural Network are used

to train the input data; however, this paper focuses on Neural

Network which uses the feed-forward algorithm to classify the

data and to establish the approximate function. The

backpropagation algorithm is a multilayer network, it uses log-

sigmoid as the transfer function, logsig. In the training

process, the backpropagation training functions in the feed-

forward networks is used to predict the output based on the

input data.

4. Model Validation

Student

A.Course

Recommendation

Course

Ranking Recom

mendati

on

Likelihood

of overall

GPA

B. Likelihood of

GPA

GPA for year 1

GPA for year 2

GPA for year 3

GPA for year 4

Recommenda

tion

Electronic

Intelligent

Recommendatio

n System(e-IRS)

Sub-models of A/B/C:

Computer Business


Communication Art

…


Information Technology

3. Intelligent Prediction Models

C. Subject

Recommendation

Subject

Recommen

dation

Student Historic data

1. Data Pre-processing

Data

Transformation Data Cleaning

Neural Network Decision Tree SVM

Association

Rules Result Comparison/ Best

Result


44

C. Intelligent Recommendation Model

The Integrated Recommendation Model is composed of three

parts: Course Recommendation for freshmen, Likelihood of

GPA for students (years 1 to 4), and Subject Recommendation

for students (year 1 to 4) respectively.

Part A focuses on the course recommendation for freshmen

and it is composed of two sections, which are the Overall GPA

Recommendation, and the Course Ranking Recommendation

respectively. In the section of Overall GPA, The output of this

recommendation is in terms of an expected overall GPA. The

outputs of Course Ranking Recommendation use the ranking

of results in first section to indicate five appropriate courses

The results of both parts can be used as suggestions to the

freshmen during the enrolment process. Some example results

from Part A are shown in this paper, and the input data of

these 2 sections in the model are shown in Table 1.

Another part of the framework focuses on Likelihood of

GPA for students in each year. After the students selected the

course to study and completed the enrolment process, the

Likelihood of GPA for year 1 results can be used to monitor

the performance of this group of students. The input data of

this process is the same as the one shown in Table 1, with the

addition of the GPA scores from the previous year. These are

used as the extended features in the input to the neural

network model. The result of the Recommendation is the GPA

score of the year. In the same way, the system may be used to

perform a Likelihood of GPA for Year 2 based on results from

the first year. Similar approach can be adopted for the

Likelihood of Year 3 and 4 results. Some example results of

this part are shown in this paper.

The final part of the recommendation model focuses on the

subject recommendation for students in each year. This way

also can help the counselor or student’s supervisor recommend

student to enroll the subjects in each semester.

To address the issue of imbalanced number of students in

each course, the prediction model shown in Fig. 1 can be

duplicated for different departments. The models’ computation

is entirely data-driven and not based on subjective opinion,

hence, the prediction models are unbiased and they will be

used as an integral part of an Electronic Intelligent

Recommendation System.

D. Electronic Intelligent Recommendation System (e-IRS)

It is planned that the new intelligent Recommendation

Models will form an integral part of an online system for

private universities in Thailand. The developed system will be

evaluated by the university management and feedback from

experienced counselors will be sought. The proposed system

will also be available for use by new students who will access

the online-application in their course selection during the

enrolment process. As for the recommendation of the Year 2

and subsequent years’ results, this could be used by the

counselors, staff, student’s supervisor and university

management to provide supports for students who are likely to

need help with their studies. This information will enable the

university to better focus on the utilisation of their resources.

In particular, this could be used to improve the retention rate

by providing additional supports to the group of students who

may be at risk.

V. EXPERIMENT DESIGN

The data preparation and selection process involves a

dataset of 3,550 student records from five academic years. All

the student data have included records from the first year to

graduation. Due to privacy issue, the data in this study do not

indicate any personal information, and no student is identified

in the research. The student data has been randomised, and all

private information has been removed. Example data from the

dataset is shown below.

TABLE I. EXAMPLE OF TRAINING SAMPLE DATASET

Uni

ID

Input data: previous school data tar

get

Pre

-GP

A

T

ype

of

sch

ool

N

o. o

f A

ward

s

Tale

nt

an

d I

nte

rest

Ch

an

nel

s

Adm

issi

on

Rou

nd

G

uard

ian

Occ

upati

on

Gen

der

Uni GPA

4800 2.35 C 0.2 1 Poster 1 Police F 3.75

4801 3.55 B 0.3 4 Brochur

e

2 Governor M 3.05

5001 2.55 A 0.9 3 Friend 5 Teacher F 2.09

5002 2.75 G 0.4 5 Family 4 Nurse F 2.58

5003 3.00 F 0.2 7 Newspap

er

3 Teacher M 2.77

5101 2.00 E 0.1 2 others 1 Farmer F 2.11

Table 1 shows the randomized student ID, GPA from previous

study, the type of school, awards received, talent and interest,

channels to know the university, admission round, Guardian

Occupation, Gender and Overall GPA from university. Table 2

provides the definitions for the variables used in the above

table.

TABLE II. DEFINITIONS OF VARIABLES

No. Variables Definition

1.

UniID

Randomized Student ID which is not

included in the clustering process.

They are only used as an

identification of different students

2. GPA Overall GPA results from previous

study prior to admission to university

3.

Type of school

The school types are separated as

follows

A: High School

B:Technical College

C: Commercial College

D: Open School

E: Sports, Thai Dancing, Religion or

Handcraft Training Schools

F: Other Universities

(change universities or courses)

G: Vocation Training Schools

4.

Number of

Awards

Awards that students have received

from previous study (normalized

between 0.0 to 4.0, 0.0 – received no

award, 4.0 – received max no. of


45

awards in the dataset)

5.

Talent and Interest

(in Group number)

Talent and the interest(1= sports,

2=music and entertainment, 3=

presentation, 4=academic, 5=others,

6= involved with 2 to 3 items of

talents and interests, 7= involved with

more than 3 talents and interests)

6 Channels The channels to know the university

such as television, family

7 Admission Round Admission round of each university

which can be round 1 to 5

8 Guardian

Occupation

The occupation of Guardian such as

teacher, governor

9. Gender Gender: Female or Male

10. Uni GPA Overall GPA in university which the range

is from 0 to 4

Figure 2. Number of samples in each department

The student records have been divided into 70% of training

data and 30% of testing data randomly. The dataset includes

both qualitative and quantitative information in Table 1 and 2.

In terms of training, this study used a two layer feed forward

network architecture. Moreover, this study used the Mean

Absolute Errors (MAE) to define the accuracy of the models.

VI. EXPERIMENTAL RESULTS

Based on MAE, the experimental results have shown that

the Neural Network based models can be utilised to predict the

GPA results of students with a good degree of accuracy.

0.0870.069 0.074

0.115

0.177 0.1770.190

0.125

0.172 0.182

0.0800.100

0.344

0.101

0.000

0.050

0.100

0.150

0.200

0.250

0.300

0.350

0.400

Figure 3. Comparison of MAE of testing data of sub-models for overall GPA

and course ranking Recommendation

The testing was carried out in the final step of the

experiment in each model, which used 30% of the available

data. In Fig. 3, it is shown that the lowest value of Mean

Absolute Error (MAE) is 0.069 based on data from the

Department of Accounting. On the other hand, the highest

value is 0.344. The average of MAE of all models is 0.142.

The overall results obtained indicated reasonable prediction

results were obtained.

Figure 4. Comparison of MAE of testing data on the Likelihood of GPA in

each Year

Fig. 4 shows a comparison of MAE of the results of the

sub-models from each department in each year. It can be seen

that the range of values of MAE is the lowest based on data

from the Department of Education. On the other hand, the

highest value is based on the Department of Communication

Arts, which is similar with the results of overall GPA. The

average of MAE of all models is 0.393. Considering, the

department of Public Administration gives the similar results

between each year, while the department of Communication

Art and the department of Industrial Management give the

most different results between each year, which the results of

MAE is too high from another in year 4 and year 2

respectively. It’s possible that the difference of MAE is due to


46

the number of training and testing data. The overall results

obtained have indicated reasonable recommendation results.

VII. CONCLUSIONS

This article describes a recommendation system in support

of SRM and to address issues related to the problem of course

advice or counseling for university students in Thailand. The

recent work is focusing on the development and

implementation of each process in the framework. The

experiments have been based on Neural Network models and

the accuracy of the recommendation model is reasonable. It is

expected that the recommendation system will provide a

useful service for the university management, course

counselors, academic staff and students. The proposed system

will also support Student Relationship Management strategies

among the Thai private universities.

REFERENCES

[1] R. Ackerman and J. Schibrowsky, “A Business Marketing Strategy Applied to Student Retention: A higher Education Initiative,” Journal of College Student Retention . vol. 9(3), pp. 330-336, 2007-2008

[2] J. Jr. Archer and S. Cooper,“Counselling and Mental Health Services on Campus. In A handbook of Contemporary Practices and Challenges, ” Jossy-Bass, ed. San Francisco, CA., 1998

[3] A.L. Caison, “Determinates of Systemic Retention: Implications for improving retention practice in higher education. ” Journal of College Student Retention., vol. 6, pp. 425-441, 2004-2005

[4] K. L. Du and M.N.S Swamy, “Neural Networks in a Softcomputing Framework,” Germany: Springer , vol. 1, 2006

[5] Education, “Research group of department of A study of the background of grade twelve affect students different academic achievements,” Education Research, 2000

[6] D.T. Gamage, J. Suwanabroma, T. Ueyama, S. Hada. and E.Sekikawa, “The impact of quality assurance measures on student services at the Japanese and Thai private universities,” Quality Assurance in Education, vol 16(2), pp.181-198, 2008

[7] Y. Gao and C. Zhang, “Research on Customer Relationship Management Application System of Manufacturing Enterprises,” Wireless Communications, Networking and Mobile computing, 2008 Wicom'08.4th International conference, Dalian , pp. 1-4, 2008

[8] K. Harej and R.V. Horvat, “Customer Relationship Management Momentum for Business Improvement,” Information Technology Interfaces(ITI), pp.107-111, 2004

[9] Helland, P., H.J. Stallings, and J.M. Braxton, “The fulfillment of expectations for college and student departure decisions,”, Journal of College Student Retention, vol. 3(4), pp.381-396, 2001-2002

[10] N. Jantarasapt, “The relationship between the study behavior and low academic achievement of students of Dhurakij Pundit University, Thailand”, Dhurakij Pundit University, 2005

[11] K. Jusoff, S.A.A. Samah, and P.M. Isa, “Promoting university community's creative citizenry. Proceedings of world academy of science, 2008,” Engineering and technology , vol. 33, pp. 1-6.. 2008

[12] M.B. Piedade and M. Y. Santos, “Student Relationship Management: Concept, Practice and Technological Support,” IEEE Xplore, pp. 2-5, 2008

[13] S. Subyam, “Causes of Dropout and Program Incompletion among Undergraduate Students from the Faculty of Engineering,King Mongkut's University of Technology North Bangkok”., In The 8th National Conference on Engineering Education. Le Meridien Chiang Mai, Muang, Chiang Mai, Thailand, 2009

[14] U. Uruta and A. Takano, “Between psychology and college of education,” Journal of Educational Psychology, vol 51, pp. 205-217, 2003

[15] K.W. Wong, C.C. Fung and T.D. Gedeon, “Data Mining Using Neural Fuzzy for Student Relationship Management,” International Conference of Soft Computing and Intelligent Systems, Tsukuba, Japan, 2002


47

Recommendation and Application of Fault Tolerance Patterns to Services

Tunyathorn Leelawatcharamas and Twittie Senivongse Computer Science Program, Department of Computer Engineering

Faculty of Engineering, Chulalongkorn University Bangkok, Thailand

[email protected] , [email protected]

Abstract—Service technology such as Web services has been one of the mainstream technologies in today’s software development. Distributed services may suffer from communication problems or contain faults themselves, and hence service consumers may experience service interruption. A solution is to create services which can tolerate faults so that failures can be made transparent to the consumers. Since there are many patterns of software fault tolerance available, we end up with a question of which pattern should be applied to a particular service. This paper attempts to recommend to service developers the patterns for fault tolerant services. A recommendation model is proposed based on characteristics of the service itself and of the service provision environment. Once a fault tolerance pattern is chosen, a fault tolerant version of the service can be created as a WS-BPEL service. A software tool is developed to assist in pattern recommendation and generation of the fault tolerant service version.

Keywords - fault tolerance patterns; Web services; WS-BPEL

I. INTRODUCTION

Service technology has been one of the mainstream technologies in today’s software development since it enables rapid flexible development and integration of software systems. The current Web services technology builds software upon basic building blocks called Web services. They are software units that provide certain functionalities over the Web and involve a set of interface and protocol standards, e.g. Web Service Definition Language (WSDL) for describing service interfaces, SOAP as a messaging protocol, and Business Process Execution Language (WS-BPEL) for describing business processes of collaborating services [1]. Like other software, services may suffer from communication problems or contain faults themselves, and hence service consumers may experience service interruption.

Different types of faults have been classified for services [2], [3], [4], and can be viewed roughly in three categories: (1) Logic faults comprise calculation faults, data content faults, and other logic-related faults thrown specifically by the service. Web service consumers can detect logic faults by WSDL fault messages or have a way to check correctness of service

responses. (2) System and network faults are those that can be identified, for example, through HTTP status code and detected by execution environment, e.g., communication timeout, server error, service unavailable. (3) SLA faults are raised when services violate SLAs, e.g., response time requirements, even though functional requirements are fulfilled. For service providers, one of the main goals of service provision is service reliability. Services should be provided in a reliable execution environment and prepared for various faults so that failures can be made as transparent as possible to service consumers. Service designers should therefore design services with a fault tolerance mindset, expecting the unexpected and preparing to prevent and handle potential failures.

There are many fault tolerance patterns or exception handling strategies that can be applied to make software and systems more reliable. Common patterns involve how to handle or recover from failures, such as communication retry or the use of redundant system nodes. In a distributed services context, we end up with a question of which fault tolerance pattern should be applied to a particular service. We argue that not all patterns are equally appropriate for any services. This is due to the characteristics of each service including service semantics and the environment of service provision. In this paper, we propose a mathematical model that can assist service designers in designing fault tolerant versions of services. The model helps recommend which fault tolerance patterns are suitable for particular services. With a supporting tool, service designers can choose a recommended pattern and have fault tolerant versions of the services generated as WS-BPEL services.

Section II discusses related work in Web services fault tolerance. Section III lists fault tolerance patterns that are considered in our work. Characteristics of the services and condition of service provision that we use as criteria for pattern recommendation are given in Section IV. Section V presents how service designers can be assisted by the pattern recommendation model. The paper concludes in Section VI with future outlook.

II. RELATED WORK

A number of researches in the area of fault tolerance services address the application of fault tolerance patterns to WS-BPEL processes even though they may have a different use of fault tolerance terminology for similar patterns or


48

strategies. For example, Dobson’s work [5] is among the first in this area which proposes how to use BPEL language constructs to implement fault tolerant service invocation using four different patterns, i.e., retry, retry on a backup, and parallel invocations to different backups with voting on all responses or taking the first response. Lau et al. [6] use BPEL to specify passive and active replication of services in a business process and also support a backup of BPEL engine itself. Liu et al. [2] propose a service framework which combines exception handling and transaction techniques to improve reliability of composite services. Service designers can specify exception handling logic for a particular service invocation as an Event-Condition-Action rule, and eight strategies are supported, i.e., ignore, notify, skip, retry, retryUntil, alternate, replicate, and wait. Thaisongsuwan and Senivongse [7] define the implementation of fault tolerance patterns, as classified by Hanmer [8], on BPEL processes. Nine of the architectural, detection, and recovery patterns are addressed, i.e., Units of Mitigation, Quarantine, Error Handler, Redundancy, Recovery Block, Limit Retries, Escalation, Roll-Forward, and Voting. These researches suggest that different patterns can be applied to different service invocations as appropriate but are not specific on when to apply which. Nevertheless we adopt their BPEL implementations of the patterns for the generation of our fault tolerant services.

Zheng and Lyu present interesting approaches to fault tolerant Web services which support strategies including retry, recovery block, N-version programming (i.e., parallel service invocations with voting on all responses), and active (i.e., parallel service invocations with taking the first response). For composite services, they propose a QoS model for fault tolerant service composition which helps determine which combination of the fault tolerance strategies gives a composite service the optimal quality [9]. In the context of individual Web services, they propose a dynamic fault tolerance strategy selection for a service [3]; the optimal strategy is one that gives optimal service roundtrip time and failure rate. Both user-defined service constraints and current QoS information of the service are considered in the selection algorithm. In [10], they view fault tolerance strategies as time-redundancy and space-redundancy (i.e., passive and active replication) as well as combination of those strategies. Although their approaches and ours share the same motivation, their fault tolerance strategy selection requires an architecture that supports service QoS monitoring and provision of replica services. This could be too much to afford for strategy selection, for example, if it turns out that expensive strategies involving replica nodes are not appropriate. This paper can be complementary to their approach but it is more lightweight by merely recommending which fault tolerance strategies are likely to match service characteristics that are of concern to service designers.

III. FAULT TOLERANCE PATTERNS

In our approach, the following fault tolerance patterns are supported (Fig. 1). They are addressed in Section II and can be expressed using BPEL which is the target implementation of our fault tolerant services. Here the term “service” to which a pattern will be applied refers to the smallest unit of service provision, e.g., an operation of a Web service implementation.

Figure 1. Fault tolerance patterns.

Call Replica

Call Service

[Fail]

(1) Retry (2) Wait

(4) RB NVP

[Fail]

(3) RB Replica

(5) Active Replica (6) Active NVP

(7) Voting Replica (8) Voting NVP

(9) Retry + Wait

[WaitCondition] Call Service

[Fail and RetryCondition]

Call Service

[Fail and RetryCondition]

[ WaitCondition]

Call Service

Call Service


49

1) Retry: When service invocation is not successful, invocation to the same service is repeated until it succeeds or a condition is evaluated to true. A common condition is the allowed retry times.

2) Wait: Service invocation is delayed until a specified time. If the service is expected to be busy or unavailable at a particular time, delaying invocation until a later time could help decrease failure probability.

3) RecoveryBlockReplica: When service invocation is not successful, invocation is made sequentially to a number of functionally equivalent alternatives (i.e., recovery blocks) until the invocation succeeds or all alternatives are used. Here the alternatives are replicas of the original service; they can be different copies of the orignal service but are provided in different execution environments.

4) RecoveryBlockNVP: This pattern is similar to 3) but adopts N-version programming (NVP). Here the original service and its alternatives are developed by different development teams or with different technologies, algorithms, or programming languages, and they may be provided in the same or different execution environment. This would be more reliable than having replicas of the original services as alternatives since it can decrease the failure probability caused by faults in the original service.

5) ActiveReplica: To increase the probability that service invocation will return in a timely manner, invocation is made to a group of functionally equivalent services in parallel. The first successful response from any service is taken as the invocation result. Here the group are replicas of each other; they can be different copies of the same service but are provided in different execution environments.

6) ActiveNVP: This pattern is similar to 5) but adopts NVP. Here the services in the group are developed by different development teams or with different technologies, algorithms, or programming languages, and they may be provided in the same or different execution environment. This would be more reliable than having the group as replicas of each other since it can decrease the failure probability caused by faults in the replicas.

7) VotingReplica: To increase the probability that service invocation will return a correct result despite service faults, invocation is made to a group of functionally equivalent services in parallel. Given that there will be several responses from the group, one of the voting algorithms can be used to determine the final result of the invocation, e.g. majority voting. Here the group are replicas of each other; they can be different copies of the same service but are provided in different execution environments.

8) VotingNVP: This pattern is similar to 7) but adopts NVP. Here the services in the group are developed by different development teams or with different technologies, algorithms, or programming languages, but they may be provided in the same or different execution environment.

9) Retry + Wait: This pattern is an example of a possible combination of different patterns. When service invocation is

not successful, invocation is retried for a number of times and, if still unsuccessful, waits until a specified time before another invocation is made.

All patterns except Wait employ redundancy. Retry is a form of time redundancy taking extra communication time to tolerate faults whereas RecoveryBlock, Active, and Voting employ space redundancy using extra resources to mask faults [10]. RecoveryBlock uses the passive replication technique; invocation is made to the original (primary) service first and alternatives (backup services) will be invoked only if the original service or other alternatives fail. Active and Voting both use the active replication technique; all services in a group execute a service request simultaneously, but they determine the final result differently. Retry, Wait, and RecoveryBlock can help tolerate system and network faults. Voting can be used to mask logic faults, e.g., when majority voting is used and the majority of service responses are correct. It can even detect logic faults if a correct response is known. Active can help with SLA faults that relate to late service responses.

IV. SERVICE CHARACTERISTICS

The following are the criteria regarding service characteristics and condition of service execution environment which the service designer/provider will consider for a particular service. These characteristics will influence the recommendation of fault tolerance patterns for the service.

1) Transient Failure: The service environment is generally reliable and potential failure would only be transient. For example, the service may be inaccessible at times due to network problems, but a retry or invocation after a wait should be successful.

2) Instance Specificity: The service is specific and consumers are tied to use this particular service. It can be that there are no equivalent services provided by other providers, or the service maintains specific data of the consumers. For example, a CheckBalance service of a bank is specific because a customer can only check an account balance through the service of this bank with which he/she has an account, and not through the services of other banks.

3) Replica Provision: This relates to the ability of the service designer/provider to accommodate different replicas of the service. The replicas should be provided in different execution environments, e.g., on different machines or processing different copies of data. This ability helps improve reliability since service provision does not rely on a single service.

4) NVP Provision: This relates to the ability of the service designer/provider to accommodate different versions of the service. The service versions may be developed by different development teams or with different technologies, algorithms, or programming languages, and they may be provided in the same or different execution environment. This ability helps improve reliability since service provision does not rely on any single version of the service.

5) Correctness: The service designer expects that the service and execution environment should be managed to


50

provide correct results. This relates to the quality of service environment to provide reliable communication, including the mechanisms to check for correctness of messages even in the presence of logic faults.

6) Timeliness: The service designer expects that the service and execution environment should be managed to react quickly to requests and give timely results.

7) Simplicity: The service designer/provider may be concerned with simplicity of the service. Provision for fault tolerance can complicate service logic, add more interactions to the service, and increase latency of service access. When service provision is more complex, more faults can be introduced.

8) Economy: The service designer/provider may be concerned with the economy of making the service fault tolerant. Fault tolerance patterns consume extra time, costs, and computing resources. For example, sequential invocation is cheaper than parallel invocation to the group of services, and providing relplicas of the service is cheaper than NVP.

V. FAULT TOLERANCE PATTERNS RECOMMENDATION

The recommendation of fault tolerance patterns to a service is based on what characteristics the service possesses and which patterns suit such characteristics.

A. Service Characteristics-Fault Tolerance Patterns Relationship

We first define a relationship between service characteristics and fault tolerance patterns as in Table I. Each cell of the table represents the relationship level, i.e., how well the pattern can respond to the service characteristic. The relationship level ranges from 0 to 8 since there are eight basic patterns. Level 8 means the pattern responds very well to the characteristic, level 7 responds well, and so on. Level 0 means there is no relationship between the pattern and service characteristic.

For example, for Economy, Retry and Wait are cheaper than other patterns that employ space redundancy since both of them require only one service implementation. But Wait responds best to economy (i.e., level 8) since there is only a single call to the service whereas Retry involves multiple invocations (i.e., level 7). Sequential invocation in RecoveryBlock is cheaper than parallel invocation in Active and Voting because not all service implementations will have to be invoked; a particular alternative of the service will be invoked only if the original service and other alternatives fail, whereas parallel invocation requires that different service implementations be invoked simultaneously. RecoveryBlockReplica (level 6) is cheaper than RecoveryBlockNVP (level 5) because providing replicas of the service should cost less than development of NVP. Similarly ActiveReplica (level 4) is cheaper than ActiveNVP (level 3) and VotingReplica (level 2) is cheaper than VotingNVP (level 1). Note that Voting is more expensive than Active due to development of a voting algorithm to determine the final result. For a combination of patterns such as Retry+Wait, the relationship level is an average of the levels of the combining patterns.

TABLE I. RELATIONSHIP BETWEEN SERVICE CHARACTERISTICS AND FAULT TOLERANCE PATTERNS

Service Characteristics

Fault Tolerance Patterns Retry Wait RB

Replica RBNVP

Active

Replica Active

NVP Voting Replica

VotingNVP

Retry+Wait

Transient Failure (TF) 8 7 0 0 0 0 0 0 7.5

Instance Specificity (IS) 8 8 7 6 5 4 5 4 8

Replica Provision (RP) 0 0 8 0 8 0 8 0 0

NVP Provision (NP) 0 0 0 8 0 8 0 8 0

Correctness (CO) 2 2 3 4 5 6 7 8 2

Timeliness (TI) 4 1 5 6 7 8 2 3 2.5

Simplicity (SI) 8 8 7 6 5 4 3 2 8

Economy (EC) 7 8 6 5 4 3 2 1 7.5

For the relationship between other characteristics and the patterns, we reason in a similar manner. Retry and Wait suit the environment with Transient Failure. The patterns that rely on the execution of a single service at a time respond better to Instance Specificity than those that employ multiple service implementations. Replica Provision and NVP Provision are relevant to the patterns that employ space redundancy. For Correctness, Voting is the best since it is the only pattern that can mask/detect byzantine failure (i.e., the case that the services give incorrect results). Active is better than RecoveryBlock with regard to byzantine failure because the chance of getting the result that is incorrect should be lower than the case of RecoveryBlock due to the fact that the result of Active can come from any one of the redundant services that are invoked in parallel. Retry and Wait do not suit Correctness since they rely on the execution of a single service. For Timeliness, the comparison of the patterns on time performance given in [2], [3] (ranked in descending order) is as follows: Active, RecoveryBlock, Retry, Voting, Wait. For Simplicity, the logic of Retry and Wait which involves a single service is the simplest.

B. Assessment of Service Characteristics

The next step is to have the service designer assess what characteristics the service possesses; the characteristics would influence pattern recommendation.

1) Identify Dominant Characteristics: The service designer will consider service semantics and condition of service provision, and identify dominant characteristics that should influence pattern recommendation. For each characteristic that is of concern, the service designer defines a dominance level. Level 1 means the characteristic is the most dominant (i.e., ranked 1st), level 2 means less dominant (i.e., ranked 2nd), and so on. Level 0 means the service does not have the characteristic or the characteristic is of no concern.

For example, during the design of a CheckBalance service of a bank, the service designer considers Instance Specificity as the most dominant characteristic (i.e., dominance level 1) since


51

bank customers would be tied to their bank accounts that are associated with this particular service. From experience, the designer sees that the computing environment of the bank provides a reliable service and if there is a problem, it is generally transient, and hence a simple fault handling strategy is preferred (i.e., Transient Failure and Simplicity have dominance level 2). Nevertheless, the designer is able to afford exact replicas of the service if something more serious happens (i.e., Replica Provision has dominance level 3). Suppose the designer is not concerned with other characteristics, then the others would have dominance level 0. Table II shows the dominance level of all characteristics of this CheckBalance service.

2) Convert Dominance Level to Dominance Weight: a) Convert Dominance Level to Raw Score: The

dominance level of each characteristic will be converted to a raw score. The most dominant characteristic gets the highest score which is equal to the dominance level of the least dominant characteristic that is considered. Less dominant characteristics get less scores accordingly. From the example of the CheckBalance service, Replica Provision has the least dominance level of 3, so the raw score of the most dominant characteristic – Instance Specificity – is 3. Then the score for Transient Failure and Simplicity would be 2, and Replica Provision gets 1. Table III shows the raw scores of the service characteristics.

b) Compute Dominance Weight: First, divide 1 by the summation of the raw scores. For example, for the CheckBalance service, the summation of the raw scores in Table III is 8 (2+3+1+0+0+0+2+0) and the quotient would be 1/8 (0.125). Then, multiply this quotient with the raw score of each characteristic. The result would be the dominance weights of the characteristics (where the summation of the weights is 1). The weights will be used later in the recommendation model. For the CheckBalance service, the dominance weights of all characteristics are shown in Table IV.

C. Fault Tolerance Patterns Recommendation Model

We propose a model for fault tolerance patterns recommendation as in (1)

TABLE II. DOMINANCE LEVELS OF SERVICE CHARACTERISTICS

Service Characteristics Transient Failure

Instance Specificity

Replica Provision

NVP Provision

Correct ness

Timeli ness

Simplicity

Economy

Level 2 1 3 0 0 0 2 0

TABLE III. RAW SCORES OF SERVICE CHARACTERISTICS



Replica Provision

NVP Provision

Correct ness

Timeli ness

Simplicity

Economy

Score 2 3 1 0 0 0 2 0

TABLE IV. DOMINANCE WEIGHTS OF SERVICE CHARACTERISTICS



Replica Provision

NVP Provision

Correct ness

Timeli ness

Simplicity Economy

Score 0.25 0.375 0.125 0 0 0 0.25 0

P = D x R (1)

where P = A vector of fault tolerance pattern scores

D = A vector of dominance weights of service characteristics as computed in Section V.B

R = A relationship matrix between service characteristics and fault tolerance patterns as proposed in Section V.A

Therefore, given R as

Retry RBReplica ActiveReplica VotingReplica Retry+Wait Wait RBNVP ActiveNVP VotingNVP

R =

8 7 0 0 0 0 0 0 7.5

8 8 7 6 5 4 5 4 8

0 0 8 0 8 0 8 0 0

0 0 0 8 0 8 0 8 0

2 2 3 4 5 6 7 8 2

4 1 5 6 7 8 2 3 2.5

8 8 7 6 5 4 3 2 8

7 8 6 5 4 3 2 1 7.5

TF

IS

RP

NP

CO

TI

SI

EC

and, in the case of the CheckBalance service, D as

TF IS RP NP CO TI SI EC D = [ ]0.25 0.375 0.125 0 0 0 0.25 0.

The pattern recommendation P would be

Retry RBReplica ActiveReplica VotingReplica Retry+Wait Wait RBNVP ActiveNVP VotingNVP P = [ ]7.00 6.75 5.38 3.75 4.12 2.50 3.62 2.00 6.88.

The recommendation says how well each pattern suits the

service according to the characteristic assessment. The pattern with the highest score would be best suited for the service. Since the designer of the CheckBalance service pays most attention to Instance Specificity, Transient Failure, and Simplicity, the designer inclines to rely on reliable provision of a single service. The patterns that respond well to these characteristics, i.e, Retry, Wait, and Retry+Wait, are among the first to be recommended. Here, Retry is the best-suited pattern with the highest score. Since the designer can provide replica services as well but still has simplicity in mind, RecoveryBlockReplica is the next to be recommended. Voting


52

patterns and those which require NVP services are more complex strategies, so they get lower scores.

D. Generation of Fault Tolerant Service

A software tool has been developed to support fault tolerance patterns recommendation and generation of fault tolerant services as a BPEL service. The service designer will first be prompted to select service characteristics that are of interest, and then specify a dominance level for each chosen characteristic. The tool will calculate and rank the pattern scores as shown in Fig. 2 for the CheckBalance service. The designer can choose one of the recommended patterns and the tool will prompt the designer to specify the WSDL of the service together with any parameters necessary for the generation of the BPEL version. For Retry, the parameter is the number of retry times. For RecoveryBlock, Active, and Voting, the parameter is a set of WSDLs of all service implementations involved. For Wait, the parameter is the wait-until time. In this example, Retry is chosen and the number of retry times is 5. Then, a fault tolerant version of the service will be generated as a BPEL service for GlassFish ESB v2.2 as shown in Fig. 3. The BPEL version invokes the service in a fault tolerant way, implementing the pattern structure we adopt from [2], [7].

Figure 2. Pattern recommendation by supporting tool.

Figure 3. BPEL structure for Retry.

VI. CONCLUSION

In this paper, we propose a model to recommend fault tolerance patterns to services. The recommendation considers service characteristics and condition of service environment. A supporting tool is developed to assist in the recommendation and generation of fault tolerant service versions as BPEL services. As mentioned earlier, it is a lightweight approach which helps to identify fault tolerance patterns that are likely to match service characteristics according to subjective assessment of service designers. At present the recommendation is aimed for a single service. The approach can be extended to accommodate pattern recommendation and generation of fault tolerant composite services. More combinations of patterns can also be supported. In addition, we are in the process of trying the model with services in business organizations for further evaluation.

REFERENCES

[1] M. P. Papazoglou, Web Services: Principles and Technology. Pearson

Education Prentice Hall, 2008.

[2] A. Liu, Q. Li, L. Huang, and M. Xiao, “FACTS: A framework for fault–tolerant composition of transactional Web services”, IEEE Trans. on Services Computing, vol.3, no.1, 2010, pp. 46-59.

[3] Z. Zheng and M. R. Lyu, “An adaptive QOS-aware fault tolerance strategy for Web services”, Empirical Software Engineering, vol.15, issue 4, 2010, pp. 323-345.

[4] A. Avizienis, J. C. Laprie, B. Randell, and C. Landwehr, “Basic concepts and taxonomy of dependable and secure computing”, IEEE Trans. on Dependable and Secure Computing, vol.1, no.1, 2004, pp. 11-33.

[5] G. Dobson, “Using WS-BPEL to implement software fault tolerance for Web services”, In Procs. of 32nd EUROMICRO Conf. on Software Engineering and Advanced Applications (EUROMICRO-SEAA’06), 2006, pp. 126-133.

[6] J. Lau, L. C. Lung, J. D. S. Fraga, and G. S. Veronese, “Designing fault tolerant Web services using BPEL”, In Procs. of 7th IEEE/ACIS Int. Conf. on Computer and Information Science (ICIS 2008), 2008, pp. 618-623.

[7] T. Thaisongsuwan and T. Senivongse, “Applying software fault tolerance patterns to WS-BPEL processes ”, In Procs. of Int. Joint Conf. on Computer Science and Software Engineering (JCSSE2011), 2011, pp. 269-274.

[8] R. Hanmer, Patterns for Fault Tolerant Software. Chichester: Wiley Publishing, 2007.

[9] Z. Zheng and M. R. Lyu, “A QoS-aware fault tolerant middleware for dependable service composition”, In Procs. of IEEE Int. Conf. on Dependable Systems & Networks (DSN 2009), 2009, pp. 239-249.

[10] Z. Zheng and M. R. Lyu, “Optimal fault tolerance strategy selection for Web services”, Int. J. of Web Services Research, vol.7, issue 4, 2010, pp.21-40.


53

Development of Experience Base Ontology to Increase Competency of Semi-automated ICD-10-TM

Coding System

Wansa Paoin Faculty of Information Technology

King Mongkut’s University of Technology, North Bangkok

Bangkok, Thailand [email protected]

Supot Nitsuwat Faculty of Information Technology

King Mongkut’s University of Technology, North Bangkok


Abstract— The objectives of this research were to create the International Classification of Diseases, 10th edition, Thai Modification - ICD-10-TM experience base ontology, to test usability of the ICD-10-TM experience base with knowledge base in a semi-automated ICD coding system, and to increase competency of the system. ICD-10-TM experience base ontology was created by collecting 4,880 anonymous patient records coded into ICD codes from 32 volunteer expert codes working in different hospitals. Data were checked for misspelling and mismatch elements and converted into experience base ontology using n-triple (N3) format of resource description framework. The semi-automated coding software could search experience base when initial searching from ICD knowledge base yielded no result. Competency of the semi-automated coding system was tested using another data set contain 14,982 diagnosis from 5,000 medical records of anonymous patients. All ICD codes produced by the semi-automated coding system were checked against the correct ICD codes validated by ICD expert coders. When the system use only ICD knowledge base for automated coding, it could find 7,142 ICD codes (47.67%), recall = 0.477, precision =0.909 , but when it used ICD knowledge base with experience base search, it could find 9,283 ICD codes (61.96%), recall = 0.677, precision = 0.928. This increase ability of the system was statistical significant (paired T-test p-value = 0.008 (< 0.05). This research demonstrated a novel mechanism to use experience base ontology to enhance competency of semi-automate ICD coding system. The model of interaction between knowledge base and experience base developed in this work could be used as a basic knowledge for development of other computer systems to compute intelligence answer for complex questions as well. Keywords-experience base, knowledge base¸ ontology, semi-automated ICD coding system

I. INTRODUCTION Ontology is a data structure, a data representation tool to

share and reuse knowledge between artificial intelligence systems which share a common vocabulary. Ontology could be used as a knowledge base for computer system to compute intelligence answer for complex questions like ICD-10-TM (The International Classification of Diseases and Related Health Problems, 10th Revision, Thai Modification) [1] coding.

ICD-10 is a classification that was created and maintained by the World Health Organization –WHO since 1992 [2]. The electronic versions of ICD-10 was released in 2004 as a browsing software in CD-ROM package [3] and as ICD-10 online on WHO website [4]. Both electronic versions provided only a simple word search service that facilitate only minor part of the complex ICD coding processes. Since 2000 AD, some countries add more codes from medical expert opinions into ICD-10 so ICD-10 was modified in some countries e.g. Australia, Canada, Germany etc. In Thailand ICD-10 was modified to be ICD-10-TM (Thai Modification) since 2000 [5] and is maintained by Ministry of Public Health, Thailand .

ICD coding is an important task for every hospital. After a medical doctor complete treatment for a patient, the doctor must summarized all diagnosis of the patient into a form of diagnosis and procedures summary. Then a clinical coder will start ICD coding for that case using manual ICD coding process which use two ICD books as reference sources. All ICD codes for each patient will be used for morbidity and mortality statistical analysis and reimbursement of medical care cost in hospital. Manually ICD coding processes are complex. The ICD coding could not be finished merely by word matching between diagnosis words and list of ICD codes/labels, a clinical coder may assign two different ICD codes for two patients with same diagnosis word based on each patient context. Unfortunately, this complexity of ICD-coding were not recognized by most researchers who tried to develop semi-automated and automated ICD coding systems in the past.

Several research works mentioned automated ICD coding

process in their researches. Diogene 2 program [6] built medical terminology table and used it to map diagnosis word into morphosemantem (word-form) layer, then converted the term into concept layer before matching to labels of ICD codes in expression layer. Heja et al [7] did matching diagnosis words with list of ICD code labels and suggested that hybrid model yield better matching results. Pakhomov et al [8]


54

designed an automated coding system to assign codes for out-patient diagnosis using example-based and machine learning techniques. Periera et al [9] built a semi-automated coding help system using an automated MeSH-based indexing system and a mapping between MeSH and ICD-10 extracted from the UMLS metathesaurus. These previous works, only used word matching approach processes and never covered full standard ICD coding processes, which had been summarized in ICD-10 volume-2 [10].

In our previous work [11] we had created ICD-10-TM ontology as a knowledge base for development of semi-automated ICD coding. ICD-10-TM ontology contains 2 main knowledge bases i.e. tabular list knowledge base and index knowledge base with 309,985 concepts and 162,092 relations. Tabular list knowledge base could be divided into upper level ontology, which defined hierarchical relationship between 22 ICD chapters, and lower level ontology which defined relations between chapters, blocks, categories, rubrics and basic elements (include, exclude, synonym etc.) of the ICD tabular list. Index knowledge base described relation between keywords, modifiers in general format and table format of the ICD index.

ICD-10-TM ontology was implemented in semi-automated ICD-10-TM coding software as a knowledge base. The software was distributed by the Thai Health Coding Center, Ministry of Public Health, Thailand [12]. The coding algorithms will search matching keywords and modifiers from the index ontology and diagnosis knowledge base, then verify code definition, include and exclude conditions from tabular list ontology. The program will display all ICD-10-TM codes found or not found to the clinical coder, then the human coder could accept the codes or change to other codes based on her judgment and standard coding guideline. Users survey revealed good results got from ontology search with high user satisfaction (>95%) on well usability of the ontology. When we tried to use the system to do automate coding i.e. to code all diagnosis before a clinical coder start coding, to reduce number of diagnosis to be coded by clinical coder, we found that the automated coding work based on the ICD-10-TM ontology could successfully code diagnosis words for 24-50% of all diagnosis words. To increase competency of the system, we created another ontology call “experience base” to help the system to be able to code more diagnosis words than previously done.

In this paper, we present ICD-10-TM experience base and the application of a novel mechanism using experience base ontology to enhance competency of semi-automate ICD coding system. The model of interaction between knowledge base and experience base developed in this work could be used as a basic knowledge for development of other computer systems to compute intelligence answer for complex questions as well.

II. METHODOLOGY To create knowledge base, we asked all expert coders in

Thailand to volunteer participate in this project. An expert coder must had at least 10 years experience on ICD coding, or had passed the examination for certified coder (intermediate level) by the Thai Health Coding Centers, Ministry of Public Health to be able to participate. The project committee selected 42 expert coders from 198 volunteers based on their ability to devote time for the project, hospital size, hospital location where the coders work and competency on using computer and software.

All selected expert coders attended one day training on how to use the semi-automated coding system. Each of them was assigned to use the system to do ICD coding. They used medical records of patients admitted into their hospital during January to November 2011 as input to the system. The input data did not include patient identification data. Only sex, age and obstetrics condition of each patient must be input into the system since these data elements, as well as all diagnosis words, are essential for ICD code selection by the system. Each expert coder must input at least 100 different cases into the system within 30 days. After finishing their task, each coder sent the saved data to the project coordinator by email. Data from all expert coders were checked for misspelling and mismatch elements (for example, a male patient could not be obstetrics case). Records of patient type with each diagnosis word and ICD code from every cases were created using n-triple (N3) format of resource description framework – RDF [10] to built the experience base ontology. The ontology was built into the system using inverted index structure by transforming into Lucene 3.4 [13] search engine library which is the core engine of the semi-automated ICD coding system. The new semi-automated coding system now has another ontology - ICD experience base created from expert coders work. The automated coding algorithm had one new step. This step will be executed when searching from ICD knowledge base yielded no result. When ICD code was not found after searching from ICD knowledge base, the system will search from ICD experience base. Sometimes ICD code of a diagnosis with the same patient context varies from one expert opinion to another, the system will select the ICD code with highest frequency of expert opinion.

Competency of the semi-automated coding system was tested using another set of patient data. This dataset contains 14,982 diagnosis from 5,000 medical records of patients admitted during January to June 2011, into another hospital which did not participate in the knowledge base creation. Every ICD codes in this dataset were validated for 100% accuracy by another three expert coders. All ICD codes produced by the semi-automated coding system when using knowledge base only and when using knowledge base with experience base were checked against the correct ICD codes in the dataset for accuracy.


55

III. RESULTS By the end of the project 4,880 diagnosis words and patient

context were collected from 32 expert coders. Ten expert coders did not send the cases within the dateline, so their data were excluded from analysis in this phase. All 4,880 diagnosis words and patient context were used to created the experience base ontology. A python script written and used to transform each record from comma separate value file format to RDF N3 files.

The experience base ontology contains five concepts and

four relations as shown in Table 1. Each diagnosis word in a patient record could be uniquely identified. Each ICD expert opinion on the ICD code that should be used for each diagnosis word based on the patient context was an important concept in the ontology. All these concepts and relations were used to construct all RDF statements in the experience base ontology. For example if an expert ‘[email protected]’ gave an opinion that a diagnosis word ‘disseminated tuberculosis’ in a patient context ‘man not newborn’ should be coded to ICD code ‘A18.3’, the RDF statements in the N3 format will be written as the following phrase; dxword: disseminated_tuberculosis word:hasPtDxId ptdxid:001 . ptdxid:001 pt:isA ptcontext: man_not_newborn . ptdxid:001 icd:codeBy expert:[email protected] . ptdxid:001 icd:hasCode icd10:A183 .

The experience ontology concepts and relations can be presented as a graph data as in Figure 1.

TABLE I. ALL EXPERIENCE BASE CONCEPTS AND RELATIONS IN RDF N3 FORMAT

Experience Base Concepts

Ontology type

RDF Format

Example

Diagnosis Word

Concept dxword: dxword: disseminated_tuberculosis

PatientDiag ID

Concept ptdxid: ptdxid:001

Patient Context

Concept ptcontext: ptcontext:man_not_newborn

Expert Concept expert: expert:[email protected] ICD10 Code Concept icd10: icd10:A183 hasPtDxId Relation word:hasPt

DxId dxword:dyslipidemia

word:hasPtDxId ptdxid:101 isA Relation pt:isA ptid:101 pt:isA ptcontext:

man_not_newborn codeBy Relation icd:codeBy ptid:101 icd:codeBy expert:abc hasCode Relation icd:hasCode Ptdxid:101 icd:hadCode

icd10:E78.5

The system was used to automate coded 14,982 diagnosis in the test dataset. When the system use only ICD knowledge base, it could find 7,142 ICD codes (47.67%), but when it used ICD knowledge base with experience base search, the system could find 9,283 ICD codes (61.96%). This increase ability was tested for statistical significant using paired T-test with alpha value = 0.05. T-stat = -79.30 with p-value = 0.008 (< 0.05).

Recall and precision of the system were calculated. The recall and precision value when the system used ICD knowledge base only were 0.477 and 0.909 , while the recall and precision value when the system used ICD knowledge base with experience base were 0.677 and 0.928.

Figure 1. A part of the ICD experience base. A diagnois word “Dyslipidemia” in each patient record could be code to various ICD codes,

based on each expert opinion and each patient context.

IV. DISCUSSION ICD-10 coding is not a simple word matching process.

Qualified human ICD coders will never do simple diagnosis word search or browse the diagnosis term from a list of ICD codes and labels. Unfortunately, research on semi-automated and automated ICD coding system in the past [6-9] never recognize this important concept. This finding explained why there is no real workable automated ICD coding system until now.

ICD index and tabular list of disease were created since 1992, diagnosis words in ICD did not include every synonym, alternative name or some specific diagnosis in highly specialized medical service. On the other hand, ICD classification added some patient context into classification scheme, this made coding for one disease name may produced different ICD codes if the patient context change. For example an ICD code for diagnosis “internal hemorrhoids” would be O22.4 when the patient was a pregnant woman, but the code will be I84.2 for an adult man patient. These facts made ICD coding a complex job and need human coders. A clinical coder must know how to change some diagnosis word when first round searching could not find the code. She must had patient records in hand all the time she was coding to check necessary patient context that may affected correct ICD code choosing.

Our semi-automated ICD coding system was not developed to replace all the clinical coders work on ICD coding. But if the system could find initial ICD codes for some diagnosis word summarized by the medical doctor, the coder works will be reduced in some extents. Our system used ICD ontology created from ICD-10-TM alphabetical index and tabular list of

Dyslipidemia101

abc E78.5

Man, not newborn

102

abc

E78.5

Women, not newborn, not preg

103

abc

E78.6

Man, not newborn

Diagnosis Word Ptdx ID

Expert ID ICD

Patient Context


56

disease as knowledge bases to search for correct ICD code for each diagnosis word + patient context. Automated coding base on this knowledge could code 47.67% of all diagnosis with good accuracy (90.9%).

Recall ability of the old system was low because in real world medical records there are many varieties of words that the doctors may used for diagnosis. Some words are new words which occured after ICD-10 creation, for examples “dyslipidemia, chronic kidney disease, diabetes mellitus type 2” are more common used by doctors today than the old words “hyperlipidemia, chronic renal failure, non-insulin dependent diabetes mellitus” found in ICD-10.

Adding experience base created from real world cases into the system could increase recall ability of the system. ICD experience base ontology contains diagnosis words from real medical records with assigned ICD codes for these new words. So the system will search the experience base if first round searching from knowledge base yield no ICD code. Recall ability of the system increased from 0.477 to 0.677 with good precision ability (0.928).

Different expert opinions for same diagnosis were anticipated to be found in the experience base. In facts a consensus of expert opinion was rarely found in ICD coding experience base. Varieties of expert opinions on coding of some diagnosis words were shown in Table 2. The system will choose code with the highest frequency to be used as a “correct” code. This strategy should be good unless there were too few opinions for some rare diagnosis words.

TABLE II. EXPERT OPINION OF SOME DIAGNOSIS WORD IN ICD EXPERIENCE BASE ONTOLOGY

Diagnosis words

ICD codes from expert opinion

Highest frequency code

Dyslipidemia E78.5, E78.6, E78.9 E78.5 (64.5%)

Chronic kidney disease

N18.0, N18.9, N19 N18.9 (35.5%)

Triple vessels disease

I21.4, I25.1, I25.9, N18.9

I25.1 (80%)

Diabetes mellitus type 2

E11.9, E11 E11.9 (95.8%)

Although the ICD experience base ontology at this stage contains only 4,880 cases. This experiment encouraged usage of experience ontology to increase recall ability of the semi-automated ICD coding system. In future research work, we plan to add more cases into the experience base and will try to test the ability of the system with more test data.

V. CONCLUSION ICD experience base ontology could be created using ICD

codes from medical records which was coded by expert coders. This experience base ontology was implemented into the semi-automated ICD coding system. Searching from experience base was very useful when first round searching

from knowledge base yielded no result. The recall ability of the system could be increased by adding experience base searching into its algorithm with good precision ability still was preserved.

ACKNOWLEDGMENT This research was supported by the Thai National Health Security Office, Thai Health Standard Coding Center (THCC), Ministry of Public Health, Thailand and Thai Collaborating Center for WHO-Family of International Classification.

REFERENCES [1] Bureau of Policy and Strategy, Ministry of Public Health, International

Statistical Classification of Disease and Related Health Problems, 10th Revision, Thai Modification (ICD-10-TM). Nonthaburi, Thailand: The Ministry of Public Health, 2009.

[2] The World Health Organization, International Statistical Classification of Diseases and Related Health Problems, 10th Revision. Geneva, Switzerland: The World Health Organization, 1992.

[3] The World Health Organization, International Statistical Classification of Diseases and Related Health Problems, 10th Revision, 2nd Edition. Geneva, Switzerland: The World Health Organization, 2004.

[4] The World Health Organization. ICD-10 online [internet]. Geneva, Switzerland: The World Health Organization; 2011 [cited 2011 Jun 30]. Available from http://www.who.int/classifications/icd/en/.

[5] Bureau of Policy and Strategy, Ministry of Public Health, Thailand. International Statistical Classification of Disease and Related Health Problems, 10th Revision, Thai Modification (ICD-10-TM). Nonthaburi, Thailand: The Ministry of Public Health, Thailand: 2000.

[6] C. Lovis, R. Buad, A.M. Rassinoux, P.A. Michel and J.R. Scherrer, “Building medical dictionaries for patient encoding systems: A methodology,” in: Artificial Intelligence in Medicine. Heidelberg: Springer, 1997, pp. 373–380.

[7] G. Heja and G. Surjan, “Semi-automatic classification of clinical diagnoses with hybrid approach,” in: Proceedings of the 15th symposium on computer based medical system - CBMS 2002. IEEE Computer Society Press; 2002,pp. 347–352.

[8] S.V.S. Pakhomov, J.D. Buntrock and C.G. Chute. “Automating the assignment of diagnosis codes to patient encounters using example-based and machine learning techniques,” J Am Med Inform Assoc, 2006, 13 pp.516 –525.

[9] S. Periera, A. Neveol , P. Masari and M. Joubert, “Construction of a semi-automated ICD-10 coding help system to optimize medical and economic coding” in A. Hasman et al, editors. Ubiquity: Technologies for Better Health in Aging Societies, VA: IOS Press, 2006 pp.845-850.

[10] The World Health Organization. International Statistical Classification of Diseases and Related Health Problems, 10th Revision, 2nd Edition, Volume 2. Geneva, Switzerland: The World Health Organiztion; 2004. p.32.

[11] S. Nitsuwat and W. Paoin, “Development of ICD-10-TM ontology for semi-automated morbidity coding system in Thailand” Methods of Information in Medicine, in press.

[12] Semi-automated ICD-10-TM coding system [internet]. Nonthaburi, Thailand: The Thai Health Coding Center, Ministry of Public Health, Thailand; [cited 2011 Aug 12]. Available from : http://www.thcc.or.th/formbasic/regis.php.

[13] RDF Notation 3 [internet]: The World Wide Web Consortium; [cited 2011 Jun 12]. Available from: http://www.w3.org/DesignIssues/Notation3.

[14] Apache Lucene [internet]: The Apache Software Foundation; [cited 2012 Jan 24]. Available from http://lucene.apache.org/java/docs/index.html.


57

Collocation-Based Term Prediction for Academic Writing

Narisara Nakmaetee*, Maleerat Sodanil* *Faculty of Information Technology

King Mongkut’s University of Technology North Bangkok Bangkok, Thailand

[email protected], [email protected]

Choochart Haruechaiyasak† †Speech and Audio Technology Laboratory (SPT)

National Electronics and Computer Technology Center Pathumthani, Thailand

[email protected]

Abstract—Research paper is a kind of academic writing which is a formal writing. Academic writing should not contain any mistake, otherwise it would make the authors look unprofessional. In general, academic writing is a difficult task especially for non-native speakers. Appropriate vocabulary selection and perfect grammar are two of many important factors that make the writing appear formal. In this paper, we propose and compare various collocation-based feature sets for training classification models to predict verbs and verb tense patterns for academic writing. The proposed feature sets include n-grams of both Part-of-Speech (POS) and collocated terms preceding and following the predicted term. From the experimental results, using the combination of Part-Of-Speech (POS) and selected terms yielded the best accuracy of 50.21% for term prediction and 73.64% for verb tense prediction.

Keywords: Academic writing; collocation; n-gram; Part-of-Speech (POS)

I. INTRODUCTION In broad definition, academic writing is any

writing done to fulfill a requirement of a college or university [16]. There are several academic document types such as book report, essay, dissertation and research paper. Academic writing is different from general writing because it is formal writing. Many factors contribute to the formality of a text; major influences include vocabulary selection, perfect grammar and writing structures. For researchers, academic writing is an important channel to publish their new knowledge, ideas or arguments. Any mistakes should not occur in academic writing, because these mistakes will make the researcher look unprofessional. Moreover, errors in academic writing may result in research paper’s rejection. Thus, academic writing is a difficult task especially for non-native speakers.

At present, there are many software packages to help researchers write research papers. The software could be classified into two groups: academic writing software and grammar checker software. For academic writing software, they provide academic writing style templates such as APA, MLA and Chicago. Moreover, they provide the page layout controller, reference and citation features. For grammar checker software, they provide grammar checking feature, spelling feature,

and grammar suggestion feature based on dictionary. Some of them provide general writing templates such as e-mail template and business letter template. From our review, we found that academic writing software cannot suggest suitable vocabularies for academic writing because they suggest vocabulary based on synonyms. The words, which are synonymous, may be formal or informal. Academic writing word should only be formal.

In this paper, we focus on two factors that impact on formal writing: vocabulary selection and perfect grammar. For vocabulary selection, there are two associated problems. The first problem is the appropriate word selection. Non-native speakers often have difficulty in selecting an appropriate vocabulary for academic writing because they tend to look up for a word in a dictionary and use it without considering the word sense. They probably do not know exact meaning of the word. Moreover, they often tend to use very basic vocabulary, instead of a more sophisticated one. For example, given two sentences as follows.

(1) We talk about the main advantages of our methodology.

(2) We discuss the main advantages of our methodology.

Even though sentences (1) and (2) have the same meaning, they use different verbs, talk about and discuss. Sentence (2) is more formal than sentence (1). The second problem is collocation. Collocation error is a common and persistent error type among non-native speakers. Due to the collocation error, a piece of writing may be lacked of significant knowledge which might cause loss of precision. For example, given the following two sentences,

(3) Numerous NLP applications rely search engine queries.

(4) Numerous NLP applications rely on search engine queries.

Sentence (3) contains a common error often made by a non-native speaker. For perfect grammar, the non-native speakers frequently write wrong grammar such as fragment, wrong verb tense usage. In this paper, we focus on two specific tasks: verb prediction and verb tense pattern prediction. Verb prediction is for suggesting a verb in a sentence which is suitable


58

for a given context. Verb tense pattern prediction is for suggesting correct tense for a given verb.

The remainder of this paper is organized as follows. In the next section, we review some related works in academic writings. Section III gives the details of our proposed approach. The experiments and discussion are given in Section IV. We conclude the paper and give the direction for future works in Section V.

II. RELATED WORKS There are some researches related to academic

writing which can be classified into two groups: phrasal expression extraction [3][10] and word suggestion [4][13]. Phrasal expression extraction approach is based on statistic and rule based algorithm for suggesting useful phrasal expression. Word suggestion approach adopts probabilistic model or machine learning to discover word association and to build a model for word suggestion.

Collocation is a group of two or more words that usually go together. It is useful for helping English learners improve their fluency. Moreover, we can predict the meaning of the expression form the meaning of the parts [7]. Consequently, collocation information is useful for natural language processing. Collocations include noun phase, phrasal verb, and the other stock phases [7]. However, in our study we focus on phrasal verb. There are many works related to collocation. Collocation has been presented with four groups: lexical ambiguity resolve [1][8][14], machine translation [5][6][11], collocation extraction [9][12], and collocation suggestion [4] [13] [15].

For collocation extraction and suggestion, Wrad Church and Hanks [12] proposed techniques that used mutual information to measure the association between words. Pearce [9] described a collocation extraction technique by using WordNet. The technique relied on a synonym mapping for each of word senses. Futagi [2] discussed how to deal with some of the “non-pertinent” English language learners in the development of an automated tool to detect miscollocations in learner texts significantly reduces possible tool errors. Their work focused on the factors that affected the design of collocation detection tool. Zaiu Inkpan and Hirst [15] presented an unsupervised method to acquire knowledge about the collocational behavior of near-synonyms. They used mutual information, Dice, chi-square, log-likelihood, and Fisher’s exact test to measure the degree of association between two words. Li-E Liu, Wible, and Tsao [4] proposed a probabilistic collocation suggestion model which incorporated three features: word association strength, semantic similarity and the notion of shared collocations. Wu, Chang, Mitamura, and S.Chang [13] introduced machine learning model based on result of classification to provide verb-noun collocation suggestion. They extracted the collocation which comprised components having a syntactic relationship with another one word. In this paper, we construct the

features set based on collocation: POS and collocated term.

III. OUR PROPOSED APPROACH In this section, we describe the details of different

feature set approaches for verb and verb tense pattern prediction. Both approaches are based on the collocation: Part-of-Speech (POS), and collocated terms. Fig. 1 illustrates the process for preparing the feature sets for training the prediction models. Firstly, we start by collecting a large number of research papers from the ACL Anthology website for developing our corpus. Secondly, we convert the pdf file format papers into text files and extract the abstracts from the text files. Next, we extract sentences from the documents. Then, we tokenize the input sentences into tokens. Furthermore, we tag the tokens in a sentence with the POS. Given a sentence from our corpus in (5), the process of POS tagging yields the result in sentence (6). The POS tag set is based on the Penn Treebank II guideline [18].

Figure 1. The process of feature set extraction


59

(5) More specifically this paper focuses on the robust extraction of Named Entities from speech input where a temporal mismatch between training and test corpora occurs.

(6) More/RBR specifically/RB this/DT paper/NN focuses/VBZ on/IN the/DT robust/JJ extraction/NN of/IN Named/VBN Entities/NNS from/IN speech/NN input/NN where/WRB a/DT temporal/JJ mismatch/NN between/IN training/NN and/CC test/NN corpora/NN occurs/NNS ./.

Next, we identify verb and verb tense pattern in each sentence. Table I and Table II give some examples of verb and verb tense pattern.

TABLE I. EXAMPLE SENTENCES FOR VERB PREDICTION

Example sentences Verb tag

We present the technique of Virtual Annotation as a specialization of Predictive Annotation for answering definitional what is questions.

present

This paper proposes a practical approach employing n-gram models and error correction rules for Thai key prediction and Thai English language identification.

propose

This paper investigates the use of linguistic knowledge in passage retrieval as part of an open-domain question answering system.

investigate

In this paper, we demonstrate a discriminative approach to training simple word alignment models that are comparable in accuracy to the more complex generative models normally used.

demonstrate

We evaluate the results through measuring the overlap of our clusters with clusters compiled manually by experts.

evaluate

TABLE II. EXAMPLE POS TAGGED SENTENCES FOR VERB TENSE PATTERN PREDICTION

TABLE III. EXAMPLE POS TAGGED SENTENCES FOR VERB TENSE PATTERN PREDICTION

Example sentences Verb tense pattern tag

This/DT demonstration/NN will/MD motivate/VB some/DT of/IN the/DT significant/JJ properties/NNS of/IN the/DT Galaxy/NNP Communicator/NNP Software/NNP Infrastructure/NNP and/CC show/VB how/WRB they/PRP support/VBP the/DT goals/NNS of/IN the/DT DARPA/NNP Communicator/NNP program/NN ./.

/MD /VB

First/RB we/PRP describe/VBP the/DT CU/NNP Communicator/NNP system/NN that/WDT integrates/VBZ speech/NN recognition/NNS synthesis/NN and/CC natural/JJ language/NN understanding/NN technologies/NNS using/VBG the/DT DARPA/NNP Hub/NNP Architecture/NNP ./.

/VBP

BravoBrava/NNP is/VBZ expanding/VBG the/DT repertoire/NN of/IN commercial/JJ user/NN interfaces/NNS by/IN incorporating/VBG multimodal/JJ techniques/NNS combining/VBG traditional/JJ point/NN and/CC click/NN interfaces/NNS with/IN speech/NN recognition/JJ speech/NN synthesis/NN and/CC gesture/NN recognition/NN ./.

is /VBG

We/PRP have/VBP aligned/VBN Japanese/JJ and/CC English/JJ news/NN articles/NNS and/CC entences/NNS to/TO make/VB a/DT large/JJ parallel/NN corpus/NN ./.

have /VBN

Recently/RB confusion/NN network/NN decoding/NN has/VBZ been/VBN applied/VBN in/IN machine/NN translation/NN system/NN combination/NN ./.

has been /VBN

Example sentence

POS TAGGED Sentence In /IN contrast /NN to /TO previous /JJ work

N-gram term and POS pre3-

gram

Selected term

POS TAGGED Sentence /NN we /PRP particularly /RB focus /VBP exclusively /RB

N-gram term and POS

prePOS3-gram pre2-gram prePOS2

-gram pre1-gram prePOS1-gram post1-gram postPOS

1-gram

Selected term preNoun/ prePronoun preAdv postAdv

POS TAGGED Sentence on /IN clustering /VBG polysemic /JJ verbs /NNS

N-gram term and POS

post2-gram

postPOS2-gram

post3-gram

postPOS3-gram

Selected term post-Prepo


60

TABLE IV. EXAMPLE POS TAGGED SENTENCES FOR VERB TENSE PATTERN PREDICTION

Verb feature set Verb tense pattern feature set

Term-only 1-gram, 2-gram, 3-gram Term-only 1-gram, 2-gram, 3-gram

Term&POS 1-gram, 2-gram, 3-gram POS-only 1-gram, 2-gram, 3-gram

Selected Term-only

3-gram, preNoun/prePronoun, postNoun, 3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv, 3-gram, preNoun/prePronoun, postNoun, postPrepo, 3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv, postPrepo

Term&POS 1-gram, 2-gram, 3-gram

Selected Term&POS

3-gram, preNoun/prePronoun, postNoun, 3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv, 3-gram, preNoun/prePronoun, postNoun, postPrepo, 3-gram, preNoun/prePronoun, postNoun, preAdv, postAdv, postPrepo

Selected Term&POS

1-gram, preNoun/prePronoun, postNoun, 2-gram, preNoun/prePronoun, postNoun, 3-gram, preNoun/prePronoun, postNoun

Then, we observe the POS tagged sentence and assign the feature labels as shown in Table III. From the observation of POS tagged sentences, we find that noun usually occurs in preceding and following position of a verb. Therefore, based on linguistic knowledge and observation, we select noun as part of our feature set.

In table III, “pre3-gram” and “prePOS3-gram” denote a term and a Part-of-Speech (POS) that are the previous third position of verb. Features “pre2-gram” and “prePOS2-gram” denote a term and a POS that occur in the previous second position of verb. Features “pre1-gram” and “prePOS1-gram” denote a term and a POS that exist in the previous position of verb. “post1-gram” and “postPOS1-gram” indicate a term and a POS that are the following position of verb. Features “post2-gram”, “postPOS2-gram”, “post3-gram” and “postPOS3-gram” indicate a term and a POS that exist in the second and third following position of verb. Feature “preNoun/pronoun” denotes a noun or pronoun that occurs in the preceding position of verb. Feature “preAdv” denotes an adverb that exists in the previous position of verb. Features “postAdv” and “postPrepo” indicate an adverb and a preposition that occur in the following position of verb.

The final feature labels based on the selected terms and POS tags are shown in Table IV. “Term-only” indicates the feature sets that include the terms occur in the preceding and following position of a verb. However, there are three Term-only feature sets: “1-gram Term-only”, “2-gram Term-only”, and “3-gram Term-only”. Feature set “1-gram Term-only” indicates a term that occurs in the previous and following position of a verb. Feature set “2-gram Term-only” denotes two terms that exist in the first and second preceding and following position of a verb. Feature set “3-gram Term-only” denotes three terms that occur in the first, second and third preceding and following position of a verb. “POS-only” represents the feature sets that consist of the POSs occur in the preceding

and following position of a verb. In the Table IV, there are three POS-only feature sets: “1-gram POS-only”, “2-gram POS-only”, and “3-gram POS-only”. “1-gram POS-only” indicates the feature set that includes a POS occur in the previous and following position of a verb. “2-gram POS-only” denotes the feature set that includes two POSs occur in the first and second preceding and following position of a verb. “3-gram POS-only” denotes the feature set that includes three POSs occur in the first, second and third previous and following position of a verb. “Term&POS” indicates the feature sets that consist of the terms and POSs occur in the preceding and following position of a verb. There are three Term&POS feature sets: “1-gram Term&POS”, “2-gram Term&POS”, and “3-gram Term&POS”. “1-gram Term&POS” indicates the feature set that includes one term and one POS occur in the previous and following position of a verb. “2-gram Term&POS” denotes the feature set that includes two terms and two POSs occur in the first and second preceding and following position of a verb. “3-gram Term&POS” denotes the feature set that includes three terms and three POSs occur in the first, second and third preceding and following position of a verb. For “Selected Term-only”, it represents the feature sets include of the 3-gram terms and POSs, the nouns or pronoun, or the adverbs, or a preposition that occur in the preceding and following position of a verb. There are four feature sets. The first feature set includes of the 3-gram terms, the nouns or pronoun that occur in the previous and following position of a verb. The second feature set consists of the 3-gram terms, the nouns or pronoun and the adverbs that exist in the previous and following position of a verb. The third feature set consists of the 3-gram terms, the nouns or pronoun, and a preposition that occur in the previous and following position of a verb. The last feature set consists of the 3-gram terms, the nouns or pronoun, the adverbs, and a preposition that occur in the previous and following position of a verb.

In table IV, there are two selected Term&POS


61

feature sets: “Selected Term&POS verb feature set”, “Selected Term&POS verb tense pattern feature set”. For “Selected &POS verb feature set”, there are four patterns. The first feature set includes of the 3-gram terms, the 3-gram POSs, the nouns or pronoun that occur in the preceding and following position of a verb. The second feature set consists of the 3-gram terms, the 3-gram POSs, the nouns or pronoun and the adverbs that occur in the preceding and following position of a verb. The third feature set consists of the 3-gram terms, the 3-gram POSs, the nouns or pronoun, and a preposition that exist in the preceding and following position of a verb. The last feature set consists of the 3-gram terms, the 3-gram POSs, the nouns or pronoun, the adverbs, and a preposition that occur in the preceding and following position of a verb. For “Selected Term&POS verb tense pattern feature set”, there are three patterns. The first feature set includes of the 1-gram terms, the 1-gram POS, the nouns or pronoun that occur in the preceding and following position of a verb. The second feature set consists of the 2-gram terms, the 2-gram POSs, the nouns that occur in the preceding and following position of a verb. The third feature set that consists of the 3-gram terms, the 3-gram POSs, the nouns that occur in the preceding and following position of a verb.

IV. EXPERIMENTS AND DISCUSSION From the research paper archive, ACL Anthology

[17], we collected 3,637 abstracts from ACL and HLT conferences from 2000 to 2011. We have extracted 15,151 sentences. To evaluate the performance of all feature set approaches, we use the Naive Bayes classification algorithm with 10-fold cross validation on the data set.

For verb prediction, we selected the top 10 ranked verbs found in corpus. The top-10 verbs are be, describe, present, demonstrate, propose, achieve, use, evaluate, investigate, and compare. From our corpus, we selected 3,149 sentences which contain the above top-10 verbs for evaluating the verb prediction feature sets. There are 14 feature set approaches which can be classified into four groups: term-only, term&POS, selected term-only, and selected term&POS. Table V presents the performance evaluation of verb prediction feature sets based on the accuracy. From the table, it can be observed that the performance is improved when the number of n-gram increases. Using only POS does not increase the performance of verb prediction because POS tag is a linguistic category of word in the sentence. However, we found that the performance of POS and selected term of noun feature set is better than only selected term with noun feature set. Moreover, using adverb does not help increase the performance. On the other hand, preposition helps improve the performance. The reason is that some preposition is usually collocated with a verb such as “rely on”. In summary, the best feature set is by using 3-gram POS and selected terms of noun, pronoun, and preposition. The highest accuracy is equal to approximately 50%.

TABLE V. EVALUATION RESULTS FOR FEATURES SETS OF VERB PREDICTION

TABLE VI. EVALUATION RESULTS FOR FEATURES SETS OF VERB TENSE PATTERN PREDICTION

For verb tense pattern prediction, we used the corpus which contains 15,151 sentences. Similar to the verb prediction, verb tense pattern feature set can be classified into four groups: term-only, POS-only, term&POS, and selected term&POS. Table VI presents the performance evaluation based on accuracy. It can be observed that the performance of POS-only is quite low. When we combined selected terms with POS, the performance value was increased. However, the performance of POS and selected terms

Approach Accuracy (%)

Term-only

1-gram 42.89 2-gram 48.10

3-gram 49.03

Term&POS

1-gram 42.27

2-gram 47.31

3-gram 48.14

Selected Term-only

3-gram, preNoun/pronoun, postNoun 49.16

3-gram, preNoun/pronoun, postNoun, preAdv, postAdv 48.99

3-gram, preNoun/pronoun, postNoun, postPrepo 50.05

3-gram, preNoun/pronoun, postNoun, preAdv, postAdv, postPrepo

49.32

Selected Term&POS

3-gram, preNoun/pronoun, postNoun 49.29

3-gram, preNoun/pronoun, postNoun, preAdv, postAdv 49.16

3-gram, preNoun/pronoun, postNoun, postPrepo 50.21

3-gram, preNoun/pronoun, postNoun, preAdv, postAdv, postPrepo

49.95

Approach Accuracy (%)

Term-only

1-gram 68.71

2-gram 70.27

3-gram 69.99

POS-only

1-gram 67.52

2-gram 65.58

3-gram 65.14

Term&POS

1-gram 72.72

2-gram 70.96

3-gram 70.49

Selected Term&POS

1-gram Term&POS, preNoun/pronoun, postNoun 73.64




62

of noun feature set is better than the other feature set. The best feature set is 1-gram POS and selected terms of preNoun and pronoun. The reason is noun and pronoun could provide very good clue in predicting the verb tense since they act as subjects in the sentence.

V. CONCLUSION AND FUTURE WORKS We performed a comparative study on various

feature sets for predicting verb and verb tense pattern in sentences. Four feature sets based on the Part-of-Speech (POS) tags and selected terms, such as noun and pronoun, were evaluated in the experiments. We performed experiments by using the abstract corpus as data set and Naive Bayes as the classification algorithm. From the experiment results, verb prediction by using 3-gram of POS and selected terms of noun, pronoun, and preposition feature set yielded the best result of 50.21% accuracy. For the verb tense prediction, the highest accuracy of 73.64% was obtained by using 1-gram POS and selected terms of noun and pronoun.

For our future work, we will improve the performance of verb prediction by using WordNet. WordNet is a large lexical database. Using WordNet will help find synonyms of a word with appropriate word sense. Moreover, instead of multi-class classification model, we will adopt the one-against-all classification model for improving the verb prediction results.

REFERENCES [1] D. Biber, “Co-occurrence patterns among collocations: a tool

for corpus-based lexical knowledge acquisition,” Comput. Linguist. 19, pp. 531-538, 1993.

[2] Y. Futagi, “The effects of learner errors on the development of a collocation detection tool,” Proc. of the fourth workshop on Analytics for noisy unstructured text data, pp. 27-33, 2010.

[3] S. Kozawa, Y. Sakai, et al., “Automatic Extraction of Phrasal Expression for Supporting English Academic Writing”, Proc. of the 2nd KES International Symposium IDT 2010, pp.485-493, 2010.

[4] A. Li-E Liu, D. Wible, and N. Tsao, “Automated Suggestions for Miscollocations,” Proc. of the NAACL HLT Workshop on Innovative Use of NLP for Building Educational Applications, pp. 47-50, 2009.

[5] Z. Liu et al., “Improving Statistical Machine Translation with monolingual collocation,” Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 825-833, 2010.

[6] Y. Lu and M. Zhou, “Collocation translation acquisition using monolingual corpora,” Proc. of the 42nd Annual Meeting on Association for Computational Linguistics, 2004.

[7] C. Manning and H. Schütze, “Foundations of Statistical Natural Language Processing,” MIT Press. Cambridge, MA: May 1999.

[8] D. Martinez and E. Agirre, “One sense per collocation and genre/topic variations,” Proc. of the 2000 Joint SIGDAT conference on Empirical methods in natural language processing and very large corpora: held in conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, pp. 207-215, 2000.

[9] D. Pearce, “Synonymy in Collocation Extraction,” Proc. of the NAACL 2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, 2001.

[10] Y. Sakai, K. Sugiki, et al., “Acquisition of useful expressions from English research papers,” Natural Language Processing, SNLP '09, pp. 59-62, 2009.

[11] F. Smadja et al., “Translating collocations for bilingual lexicons: a statistical approach,” Comput. Linguist. Volume 22, pp.1-38, 1996.

[12] K. Wrad Church and P. Hanks, “Word Association Norms, Mutual Information, and Lexicography,” Proc. of the 27th Annual Meeting of the Association for Computational Linguistics, pp. 76-83, 1989.

[13] J. Wu et al., “Automatic Colloacation Suggestion in Academic Writing,” Proc. of the ACL2010 Conference Short Papers, pp. 116-119, 2010.

[14] D. Yarowsky, “One sense per collocation,” Proc. of the workshop on Human Language Technology, pp. 266-271, 1993.

[15] D. Zaiu Inkpen and G. Hirst, “Acquiring collocations for lexical choice between near-synonyms,” Proc. of the ACL-02 workshop on Unsupervised lexical acquisition, pp. 67-76, 2002.

[16] “Academic writing definition”, available at: http://reference.yourdictionary.com/word-definitions/definition-of-academic-writing.html

[17] “ACL Anthology”, available at: http://aclweb.org/anthology-new/

[18] “Penn Treebank II Tags”, available at: http://bulba.sdsu.edu/jeanette/thesis/PennTags.html


63

Thai Poetry in Machine Translation An Analysis of Poetry Translation using Statistical Machine Translation

Sajjaporn Waijanya Faculty of Information Technology

King Mongkut’s University of Technology North Bangkok


Anirach Mingkhwan Faculty of Industrial and Technology Management


Prachinburi, Thailand [email protected]

Abstract—The poetry translation from original language to another is very different from general machine translator because the poem is written with prosody. Thai Poetry is composed with sets of syllables. Those rhymes, existing from stanzas, lines and the text in paragraph of the poetry, may not represent the complete syntax. This research is focus on Google and Bing machine translators and the tuning the prosody on syllable and rhyme. We have compared the errors (in percent), between the standard translators to those translators with tuning. The error rate of both translators before tune them, was at 97 % per rhyme. After tuning them, the percentage of errors decreased down to 60% per rhyme.. To evaluate the meaning, concerning the gained results of both kinds of translators, we use BLEU (Bilingual Evaluation Understudy) metric to compare between reference and candidate. BLEU score of Google is 0.287 and Bing is 0.215. We can conclude that machine translators cannot provide good result for translate Thai poetry. This research should be the initial point for a new kind of innovative machine translators to Thai poetry. Furthermore, it is a way to encourage Thai art created language to the global public as well.

Keywords-Thai poetry translation; translation evaluation; Poem machine translator

I. INTRODUCTION

Poetry is one of the fine arts in each country. The French poet Paul Valéry defined poetry as "a language within a language."[1]. Poetry can tell a story, communicate by sound and sight and can simply express feelings. Poetry translation from original language to other languages is the way to propagate the own culture to other countries in the world.

Machine translation of poetry is the challenge for researchers and developers [2]. According to Robert Frost’s definition, “Poetry is what gets lost in translation”. This statement could be considered, it’s very difficult to translate poetry from original language to other languages with original prosody. This is because poetry has specific syntax (prosody) in the different poetry type. They different in line-length (number of syllable), rhyme, meter and pattern. Many researches try to develop poetry machine translator to translate Chinese poetry, Italian poetry, Japanese (Hiku) poetry, Spanish poetry to English and translate back from English to original language such as Poetry of William Shakespeare. They were developing poetry machine translation based on a statistical machine translation technique.

For Thai Poetry and Thai Poet, Phra Sunthorn Vohara, known as Sunthorn Phu, (26 June 1786–1855) is Thailand’s best-known [3] royal poet. In 1986, the 200th anniversary of his birth, Sunthorn Phu was honored by UNESCO as a great world poet. His Phra Aphai Mani poems describe a fantastical world, where people of all races and religions live and interact together in harmony. But In Machine translation area, We never found the research about Thai poetry machine translation. Thai poetry has five major types are Klong, Chann, Khapp, Klonn and Raai.

In this paper we use the Thai prosody "Klon-Pad (Klon Suphap)" in order to translate to English. Klon-Pad has the rules of syllable, Line (Wak), Baat, Bot and relational rule of syllable in each Wak [4]. There are relations to beauty in content of creative writing and different for the prosody. Thai poetry has complexity of rhyme and syllable. Each line (Wak) of Thai poetry is not a complete sentence (SOV-Subject Object Verb). Furthermore, some Thai words can have several meanings while translated to English. These are the reasons why it is difficult to develop Thai poetry machine translator. Our studies are about two Bot of Klon8 Thai Poetry translate by two statistically machine translator which are Google Translator [5] and Bing Translator [6]. Then we tune the prosody using a dictionary and compare result of English poetry with Thai prosody in section 3. We use a case study from “Sakura, TaJ Mahal” [7] by Professor Srisurang Poolthupya to have a reference in evaluation by BLEU (Bilingual Evaluation Understudy) metric in section 4. Section 5 concludes this paper and points out the possible further works in this direction.

II. RELATED WORKS

Although we cannot find any research related to machine translator of Thai poetry to English, there are several research papers related to machine translation poetry from Chinese to English, Italian to English and French to English.

A. A Study of Computer Aided Poem Translation Appreciation [8]

This paper collecting three English versions of “Yellow Crane Tower” a poem of the Tang dynasty, applies the computational linguistic techniques available for a quantitative


64

analysis, and use BLEU metrics for automatic machine translation evaluation.

The conclusion of the currently available, computational linguistic technology is not capable of analyzing semantic calculation, which is, without a doubt, a severe drawback for poetry translation evaluation.

B. “Poetic” Statistical Machine Translation: Rhyme and Meter[9]

This is a paper of Google MT(Machine Translator) Lab. They use Google translator. Therefore they implement the ability to produce translations with meter and rhyme for phrase-based MT. They train a baseline phrase-based French-English system using WMT-09 corpora for training and evaluation, and use a proprietary pronunciation module to provide phonetic representation of English words. The evaluation use BLEU score.

The result of this research has the baseline BLEU score of 10.27. This baseline score is quite low and also has problem of system performance, it is still slow.

C. Automatic Analysis of Rhythmic Poetry with Applications to Generation and Translation[10]

This paper applies unsupervised learning to reveal word-stress patterns in a corpus of raw poetry and use these word-stress patterns, in addition to rhyme and discourse models, to generate English love poetry. Finally, they translate Italian poetry into English, choosing target realizations that conform to desired rhythmic patterns. In the section of poetry generation, FST (Finite State Transition) is used. However, this technology is having various problems, if the results have to be evaluated by humans. In part of poetry transition they use PBTM (Phrase base transition with meter). The advantage of poetry translation over generation is that the source text provides a coherent sequence of propositions and images, allowing the machine to focus on “how to say” instead of “what to say.”

III. OUR PROPOSED APPROACH

A. Methodology 1) Machine Translations

MT (Machine translation) is sub-field of computational linguistics that investigates the use of software to translate text or speech from one natural language to another. MT has two major types. these are rule-base machine translation and Statistical Machine Translation Technology.

a) Rule-based machine translation: Relies on countless built-in linguistic rules and millions of bilingual dictionaries for each language pair. The rule-based machine translation includes transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms. A typical English sentence consists of two major parts: noun phrase (NP) and verb phrase (VP).

b) Statistical machine translation: based on bilingual text corpora. The statistical approach contrasts of the rule-base

approaches to machine translation as well as with example-based machine translation.

Both translators from Google and Bing use statistical machine translators. Moreover our team is using an API for Google and Bing translators to translate Thai poetry.

2) English Syllables Rule and Phonetics.

Syllables are very important in prosody of Thai Poetry. Each Wak has a rule for number of syllables. Relation between Wak and Bot has to check the sound of the syllable. Every syllable consists of a nucleus and an optional coda. It is the part of the syllable used in poetic rhyme, and the part that is lengthened or stressed when a person elongates or stresses a word in speech.

The simplest model of syllable structure [11] divides each syllable into an optional onset, an obligatory nucleus, and an optional coda. Figure 1 is showing the structure of syllable.

Figure 1. structure of syllable

Normally we can check the relation of rhymes, by checking them relation of the sound in the syllable. This is called Phonetics. It can help us, to get to know how to pronounce the word.

B. An Algorithm and Case Study

1) System Flowchart To study Thai Poetry in Machine Translation, we use

Thai poetry Klon-Pad 2 Bot (8 lines) as input to this process. Figure 2 is showing a system flowchart of this process.

Thai Poetry Translator

1. Language Translator

2. Poetry Checking

3. Poetry Prosody Tuning

Thai

Poetry

Poetry in Eng

with Tuning

Poetry

in Eng

Figure 2. system flowchart of Thai poetry in Machine Translation


65

In Figure 2, we design three modules to translate Thai poetry to English.

a) Language Translator: we use Google and Bing API Machine Translator to process Thai Poetry translates to English.

b) Poetry Checking: used to check prosody of poetry after translate to English. The result of this module is Thai poetry in English and error point of the proetry itself.

c) Poetry Prosody Tuning: after process module2 (Poetry Checking) we collect error points and tuning the poetry by using a dictionary and a thesaurus. An expected result of this module is the percentage of error will decrease.

Case Study: we process twenty Klon-Pad Thai poetry via three modules without reference in term of English translation by professional, and we process one of Thai poetry from “Sakura, TaJ Mahal” by Professor Srisurang Poolthupya as reference and use result from Google and Bing API as reference to calculate BLEU score.

We describe three modules in sub-section 2), 3) and 4) and in Figure 3 and Figure 4.

2) Language Translator Module This module process input Thai poetry (Klon-Pad) in

term of Thai language to translate to English by Google and Bing API Machine Translator. Figure 3 is showing a process of this module.

Figure 3. Language Translator Module

a) Case Study1,Original Thai Poetry: Thai Poetry “Deuan-chaai” from book “Oh ja o dòk máai oie”.

เปนพนธไมลมลกปลกแสนงาย ชองามเดน“เดอนฉาย”ใจถวล สเหลองบางแดงบางชางโศภน ดอกเฉดฉนสลางบานทงวน “เดอนฉาย”ใชจะแขงแสงเดอนสอง เพยงชอพองเพราะพรงสมจรงนน ยงไดงใสปยเดอนฉายยงพรายพรรณ เกนจะสรรเสกพราคาเยนยอ

b) Case Study1, Translate by Google API: A herbaceous plant species growing easy.

A good performance ‘in film’ I Tawil. Some red, some yellow Ospin technician.

Eฉidฉin flowers and clean the whole day. ‘Month Movie’ I will be racing in the light shines. Just to name a synonym is real nice there. The fertilizer plant in projecting the profile. Keeps up the quality scale flattery.

c) Case Study1, Translate by Bing API:

As the cultivation of plant species is very easy. Strong desire to make beautiful films a month name. Yellow and red are really sophin? Flores choetchin prominent pane all day. Last month, the race featured a Moonlight illuminates. Only the name Allied euphonic to life there. Even more sparkling variety fertilizer month projection We are too badly, the excessive praise.

3) Poetry Checking Module This module processes Thai poetry in English term

from Google and Bing API. We analyses syntax and collected error points for prosody of Klon-Pad Thai poetry in twenty poetries. Figure 4 is showing a process of the Poetry Checking Module.

Thai Poetry in Eng

Check Line-Length (Number of Syllable)

Check Rhyme (relation of phonetic)

Check words out of Vocabulary

Collect Error for

number of syllable

Collect error for rhyme

Collect error for word out

of vocab.

Thai Poetry in Eng with

Error marks

Poetry Checking

Figure 4. Poetry Checking Module

a) Check Line Length (Number of Syllable): Thai poetry has prosody for number of syllable in line. In each line are 7 to 9 syllables allowed. If one line is having more than 9 or less than 7 syllables, an error is implicated in the length of the line.


66

From Case Study, Translate by Google API: we found 7 error lines as Table I below is showing.

TABLE I. AN EXAMPLE: THAI POERTY “DEUAN-CHAAI” TRANSLATEBY GOOGLE API.

Google Version Syllable Count

A herbaceous plant species growing easy. 11 A good performance ‘in film’ I Tawil. 10 Some red, some yellow Ospin technician. 10 Eฉidฉin flowers and clean the whole day. 9a ‘Month Movie’ I will be racing in the light shines. 12 Just to name a synonym is real nice there. 11 The fertilizer plant in projecting the profile. 13 Keeps up the quality scale flattery. 10

a. 9 syllables is not error in prosody for number of syllable in line

TABLE II. AN EXAMPLE: THAI POERTY “DEUAN-CHAAI” TRANSLATE BY BING API.

Bing Version Syllable Count

As the cultivation of plant species is very easy. 15 Strong desire to make beautiful films a month name. 12 Yellow and red are really sophin? 10 Flores choetchin prominent pane all day. 10 Last month, the race featured a Moonlight illuminates 13 Only the name Allied euphonic to life there. 13 Even more sparkling variety fertilizer month projection 17 We are too badly, the excessive praise. 10

Table I and II represents the numbers of syllables in each wak. While using Google API, this poem has only one wak, in which this number of syllables is correctly translated. When Bing API was used, not a single wak had the correct number of syllables, they are all error tagged.

b) Check Rhyme (Relation of Phonetic): Thai poetry has a rule for them Rhyme. For Klon-Pad we present the rule of Rhyme in figure 5.

Figure 5. Rhyme Prosody for Thai Poetry Klon-Pad (2 Bot)

Figure 5 show Thai poetry Klon-Pad Two Bot with 14 rules of Rhyme as flowing

• R1 relation of a1 and a2 or a1 and ax • R2 relation of b1 and b2 • R3 relation of b1 and b3 or b1 and bx • R4 relation of b2 and b3 or b2 and bx • R5 relation of b1, b2 and b3 or b1, b2 and bx • R6 relation of c1 and c2 or c1 and cx • R7 relation of d1 and d2 • R8 relation of d1 and d3 • R9 relation of d1 and d4 or d1 and dx • R10 relation of d2 and d3 • R11 relation of d2 and d4 or d2 and dx • R12 relation of d2, d3 and d4 or d2,d3 and dx • R13 relation of d3 and d4 or d3 and dx • R14 relation of d1, d2, d3 and d4 or d1, d2, d3 and dx

In this process, we check the relation of the syllables referred to the rule. A relation in Thai poetry means similar of pronunciations but it does not duplicate.

• Example 1: “today” relate with “may”, this is correct by rules of Rhyme.

• Example 2: “today” relate with “Monday”, this is error (duplicate) by rules of Rhyme.

• Example 3: “today” relate with “tonight”, this is error (not relate) by rules of Rhyme

• Case Study1, Translate by Google API:We found number of error 13 rule and correct in rule R3.

• Case Study1, Translate by Bing API:We found number of error 12 rule and correct in rule R1 and R3.

c) Check Words out of Vocabulary: We used a dictionary and thesaurus to check the meaning of these words. We found out that MT tried to translate those words by write them in term of phoneme. Actually those words might have a meaning in Thai language, but it is to complex to translate them from Thai to English in only one step. Many words should first be translated from Thai to Thai, before they can be sent to MT. Those words, MT was not able to translate, we will furthermore call in this paper: “Words out of Vocabulary”. Moreover, these words get error tagged.

• Case Study1, Translate by Google API:We found 3 words out of vocabulary. There are Tawil, Ospin and Eฉidฉin. Tawil means ‘to miss someone’ or ‘to think of someone’. Ospin means ‘beautiful’ and Eฉidฉin means ‘beautiful’.

• Case Study1, Translate by Bing API:We found 2 words out of vocabulary. There are sophin and choetchin. Sophin means ‘beautiful’ and choetchin menas ‘beautiful’.

4) Proetry Prosody Tuning. To study about basic tuning for Poetry translated by MT.

Therefore we use twenty poetries in MT to test to approach. Our basic approaches are:-

a) Word out of vocabulary: translate Thai to Thai before translate by MT.


67

b) Number of Syllable Error: the majority of the occurred errors, are having more syllables as they are allowed to have. Then we used a dictionary and thesaurus to reduce the lenght of the sentences by the help of shorter words. Afterwards an omission of the articles like "a", "and" as well as "the" was an additional possibility to decrease the lenght.

c) Rhyme Error: we tune this error by the use of a dictionary and thesaurus to change words in Rhyme position. C. Measurment Design

In this paper we separate two majors kind of measurement.

1) Error percentage We process twenty Thai poetry by calculates them

prosody error percentage as shown in the equation below.

= % (1)

Equation (1): Es means syllable error percentage of Bot, Ps means number of syllable error and Ts means total of Wak (8 Waks) in Bot.

We calculate error percent of rhyme by equation (2) flowing.

= % (2)

Equation (2): Er means Rhyme error percentage of Bot , Pr means number of rhyme error and Tr means total Rhyme (14 rhyme position) in Bot.

We calculate the error percentage related to the wrong used words by the help of a vocabulary. See the following equation (3)

= % (3)

Equation (3): EW describes the percentage of vocabulary errors per Bot. In this context PW is the number of wrong words and TW the total number of words per Bot. Maximal 72 words could be possible.

Finally we calculate the average percentage of each error type for all twenty poetries. On this way we can create a summary to evaluate the results.

2) BLEU Score BLEU (Bilingual Evaluation Understudy) [12] is an

algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine's output and that of a human. BLEU uses a modified form of precision to compare a candidate translation against multiple reference translations. The metric modifies simple precision since machine translation systems have been known to generate more words than appear in a reference text. Equation of BLUE is showed in equation 4.

(4)

When Pn: Modified n-gram precision, Geometric mean of p1, p2,..pn.

BP: Brevity penalty (c=length of MT hypothesis (candidate) , r=length of reference)

(5)

In our baseline, we use N = 4 and uniform weights

wn=1/N

IV. EXPERIMENT RESULTS In our experiments we translated twenty poetries by two

machines translations which are Google API and Bing API. Both of MT is Statistical Machine Translation. In case study2 we use poetry from “Sakura, TaJ Mahal” by Professor Srisurang Poolthupya as reference and translate Thai poetry from this book by Google and Bing as candidates. This case study we evaluate by BLEU score. Finally, we summary the results as shown in the following part.

A. Result of Thai poetry in Google and Bing Translator.

In Table III, we show the percentage of errors from three error types before tuning those result. Most of these errors are mistakes of rhyme because MT is not able to understand poetry of rhyme and meter. In the column of “Tuning” the percentage of errors after tuning is shown in three parts.

TABLE III. PERCENT OF LINE-LENGTH ERROR, RHYME ERROR AND WORDS OUT OF VOCABULARY BEFORE TUNING AND AFTER TUNING

Items Translator

Google Google with Tuning

Bing Bing with Tuning

Total line 160 Line-length

(Number of syllable Error) 50 28 87 33

Percent of syllable Error 31% 18% 54% 21% Total Rhyme 280 Rhyme Error 271 158 272 147

Percent of Rhyme Error 97% 56% 97% 62% Total words 1440

Word Out of vocabulary 50 15 87 22 Percent of Word Out of

vocabulary 2% 1% 3% 2%

B. Case Study2: poetry from “Sakura, TaJ Mahal”and BLEU Evaluation.:

The original poetry in Thai and English show in table IV.

TABLE IV. THAI POETRY FROM BOOK “SAKURA, TAJ MAHAL”

Original Thai Poetry Reference: Translate by owner of poetry

ขอนอมเกศกราบครกลอนสนทรภ โปรดรบรศษยนขอนบไหว ทานโปรดชวยอานวยพรแตงกลอนใด องกฤษไทยของใหคลองตองกระบวน สอความหมายหลายหลากไมยากเยน ตรงประเดนเปรยบเทยบไดครบถวน จบใจผวจารณอานทงมวล ชวยชชวนใหผอนคลายสบายใจ

Sunthon Phu, the great Thai poet, I pay my respect to you, my guru. May you grant me the flow of rhyme, Both in Thai and in English, That I may express my thoughts, In a fluent and precise way, Pleasing the audience and critics, Inspiring peace and well-being

)logexp(1

∑=

•=N

nnn pwBPBLEU

rcifrcif

eBP cr ≤

>

= − )/1(

1


68

We use the original English Poetry as reference to compare them to both translators from Google and Bing. The calculated BLEU score is shown in Table V.

TABLE V. BLEU SCORE OF CANDIDATE FROM GOOGLE AND BING TRANSLATOR

Poetry BLEU Google I bow my head respectfully Soonthornphu

teachers. Please get to know us, this makes me respect. Please help with any poem. Thai English proficiently to process Various meanings can be very difficult. Relevant comparative information. Reading comprehension and critical mass. The prospectus provides a relaxed feel.

0.840

0.905

0.000 0.549 0.000 0.000 0.000 0.000

Average of BLEU Score 0.287 Bing We also ketkrap the teacher verse harmonious

Mussel Please recognize this request for a given by the audience. What a blessing you, help facilitate poem Fluent in English, Thai, tongkrabuan Describe the various not complicated Completely irrelevant comparisons. Catching someone reviews read all To help you relax, prospectus

0.840

0.000

0.309 0.574 0.000 0.000 0.000 0.000

Average of BLEU Score 0.215

V. CONCLUSION AND FUTURE WORK

The generated results show that these Machine Translators have many problems by translating Poetry. MT translates poetry without prosody. It is not able to understand Poetry pattern, difficult original words and sentences itself. The reason for that is the operating principle of MT itself. They use Phrase based methods while translating from the original to another language. But Thai Poetry can be written in incomplete sentences. Moreover, Thai words especially words in poetry are very complex. Some words should be translated from Thai to Thai before they can be sending to MT. The reason why poets use more difficult words is a matter of them felling, the beauty of these words and also the beauty of the poetry itself.

The result in this paper is show percent of error too high when we use only of MT to translate poetry especially Rhyme Error. Incidentally, it is possible to decrease the error rate down to 60% when tuning the results of MT. Moreover, the occurred errors of a backward translation from Thai to Thai could be decreased down between 1% and 2%, if the used words have been out of vocabulary.

The results of BLEU score. In this paper we use only 1 reference for evaluation. In case of BLEU, if we have many references, it is better than only a single reference. However it is very difficult to find reliable references for such an evaluation, except such verified English translations from the owner of the original Thai poetry itself.

This paper is the first research dealing about Machine Translation from Thai poetry to English. In the future, hopefully we are able to establish rules and poetry pattern to use those in combination with MT to translate Thai poetry to English with prosody keeping. The prosody and meaning of poetry are very important when translate to other languages because it can present arts and culture of that country.

ACKNOWLEDGMENT

This work is supporting poetries for translation by The Contemporary Poet Association and Professor Srisurang Poolthupya. Thanks also go to Google and Bing who are owner of famous Machine Translator.

REFERENCES

[1] Poetry, How the Language Really Works: The Fundamentals of Critical Reading and Effective Writing. [online], Available: http://www.criticalreading.com/ poetry.htm

[2] Ylva Mazetti, Poetry Is What Gets Lost In Translation, [online], Available: http://www.squidproject.net/pdf/ 09_Mazetti_Poetry.pdf

[3] P.E.N. International Thailand-Centre Under the Royal Patronage of H.M. The King. Anusorn Sunthorn Phu 200 years. Amarin printing. 2529. ISBN 974-87416-1-3

[4] Tumtavitikul, Apiluck. (2001). “Thai Poetry: A Metrical Analysis. Essays in Tai Linguistics”, M.R. Kalaya Tingsabadh and Arthur S. Abramson, eds. Bangkok: Chulalongkorn University, pp.29-40.

[5] Google Code, Google Translate API v2, [online], Available: http://code.google.com/apis/language/ translate/overview.html

[6] Bing Translator, online], Available: http://www.microsofttranslator.com/ [7] Srisurang Poolthupya, Sakura Taj mahal, Bangkok, Thailand, 2010, pp.

1-2. [8] Lixin Wang, , Dan Yang, Junguo Zhu, "A Study of Computer Aided

Poem Translation Appreciation", Second International Symposium on Knowledge Acquisition and Modeling, 2009.

[9] Dmitriy Genzel, Jakob Uszkoreit, Franz Och, ““Poetic” Statistical Machine Translation: Rhyme and Meter”, Google, Inc., Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, USA, 2010, pp. 158–166.

[10] Erica Greene, Tugba Bodrumlu, Kevin Knight “Automatic Analysis of Rhythmic poetry with Applications to Generation and Translation”, Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, MIT, Massachusetts, USA, 9-11 October 2010, pp. 524–533.

[11] Syllable rule, [online], Available http://www.phonicsontheweb.com/syllables.php

[12] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu, IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA “BLEU: a Method for Automatic Evaluation of Machine Translation” Proceedings of the 40th Annual Meeting of the Association for, Computational Linguistics (ACL), Philadelphia, July 2002, pp. 311-318.

[13] L. Balasundararaman, S. Ishwar, S.K. Ravindranath, “Context Free Grammar for Natural Language Constructs an Implementation for Venpa Class of Tamil Poetry”, in Proceedings of Tamil Internet, India, 2003.

[14] Martin Tsan WONG and Andy Hon Wai CHUN, "Automatic Haiku Generation Using VSM", 7th WSEAS Int. Conf. on APPLIED COMPUTER & APPLIED COMPUTATIONAL SCIENCE (ACACOS '08), Hangzhou, China, April 6-8, 2008.


69

Keyword Recommendation for Academic

Publications using Flexible N-gram

Rugpong Grachangpun1, Maleerat Sodanil

2

Faculty of Information Technology


Bangkok, Thailand


2

Choochart Haruechaiyasak

Human Language Technology (HLT) Laboratory

National Electronics and Computer Technology Center

Pathumthani, Thailand

[email protected]

Abstract—This paper presents a method of annotate

keyword/keyphrase recommendation for academic literature.

The proposed method is flexible in order to generate flexible

lengths of phrases (flexible-gram of keywords/keyphrases) to

increase the chance of accurate and descriptive results. Several

techniques are applied such as parts of speech tagging (POS

Tagging), term co-occurrence which measures correlation-

coefficient and term frequency inverse document frequency (TF-

IDF) and finally weighting techniques. The results of the

experiment were found to be very interesting. Moreover,

comparisons against other algorithm keyword/keyphrase

extraction techniques were also investigated by the author.

Keywords-keywords recommendation; flexible N-gram;

information retrieval; POS Tagging

I. INTRODUCTION

At present, in the age of Information Technology, many academic literatures from many fields are published frequently and offered to readers via the Internet Network. Thus, searching for the desired document can sometimes be difficult due to the large volume of literature. If there were reliable ways for the information in these documents to be generated into accurate keywords and keyphrases to show the main idea or overall picture of the document it would be easier for readers to select the particular document they need. Keywords or keyphrases (multiple words that are combined) are a type of technique that is able to tell the readers quickly the basic roots of the document. Keywords/keyphrases do not only briefly tell readers the main idea of that document but also helps the people who professionally work with documents. For example, a librarian may take a long time to group enormous numbers of documents and arrange them on a shelf or in a data base. Thus, they may use keywords/keyphrases as a part of tool to classify those documents to groups.

Automatically extracting keywords/keyphrases from a document is challenging due to working with the natural language and some other things that we will cover later. Over the last decade, there have been several studies proposing

methods of keywords/keyphrases annotation. Moreover, extraction from different written media material such as WebPages, Political Records, etc. Several models have also been used to cope with these tasks such as Fuzzy logic, Neural Network models and others [1,2,3,6]. But these two models suffer from identically the same disadvantages as follows, lack of speed, hard to be reused and the accuracy is quite low when compared to statistical models. Statistical models are also used widely in this area, mathematical and statistical data or term property provides reliable and accurate results referring to frequency and location.

This study relies on not only statistical techniques to solve this problem but also combines a technique of Natural Language Processing called part-of-speech Tagging (POST) [4] in order to filter the answer set that is generated from our algorithm. The main objective of this experiment is to extract a set of keywords/keyphrases which are not firmly fixed in term of length from the results. This paper is arranged as follows, section II describes works that are related and have contributed to our experiment, section III describes the proposed framework, techniques and methodology which are used in this experiment, section IV focuses on the results and evaluation, and finally section V closes with a conclusion and future work.

II. RELATED WORKS

In this section, previous related works are described which were helpful to our experiment.

A. Term Frequency Inverse Document Frequency (TF-IDF)

Traditionally, TF-IDF is used to measure a term importance by focusing on frequency of the term which appears in a topical document and corpus. TF-IDF can be computed by (1) [5,7].

Herein, fre(p,d) is number of time phrase “p” occur in

document “d”; n is number of mentioned words in document “d”; dfp is number of document in our corpus that contains

TF-IDF = [fre(p,d)/n] ˟ [-log2 (dfp/N)] (1)


70

topical words; and N is the total number of documents in our corpus.

B. Correlation Coefficient (r)

Correlation Coefficient is a statistical technique which was used to measure the degree of relationship between a pair of objects or variables, here is a pair of terms. Correlation Coefficient can be represent as “R, r” and it can be computed by (2)

Where n is the total number of pair of word; x and y are number of element x and y respectively. The value of r can be in the range of -1 < 0 < 1. When r is close to 1, the relationship between those element is very tight. When r is positive, that means x and y have a linier proportionally positive relationship, for example, y is increased when x increases. When r is negative, that means x and y have a liner negative relationship. For example, x increases but y decreases. In the case of r is 0 or close to it, x and y are not related to the other.

Usually, in statistical theories, the relationship between two topical variables is strong when r is greater than 0.8 and weak while it is less than 0.5 [7]. In our experiment, correlation coefficient is taken to measure the degree of relationship in order to form a correct phrase. In section IV, the finest value of r is described and shows where it is applied to our experiment.

C. Part-of-Speech Tagging (POST)

Part-of-Speech Tagging (POST) is a technique or a process used in Natural Language Processing (NLP) [4]. POST is also called Grammatical Tagging or Word-Category Disambiguation [8]. The process is used to indentify word function such as Noun, Verb, Adjective, Adverb, Determiner, and so on [9]. Knowing word functions helps us to form an accurate phrase that’s generated by a machine so it is readable and understood by a human.

There are two different kinds of tagger [4]. The first is a stochastic tagger which uses statistical techniques and the second, is a rule-based tagger that focuses on peripheral words to find a tag and applies a word function for each term. Rule-based seems to be better than the other due to a word’s ability to have many functions which also affects the word’s meaning. In this study, rule-based tagging is applied to our experiment.

In this study, the algorithm extracts the part-of-speech pattern (POS pattern) by transferring all keywords from training documents with the POS Tagger, and then we collect that pattern and put it in our repository.

D. Performance Measurement

The performance is measured in three parameters, they are

Precision (P), Recall (R) and F score or Harmonic Mean.

Those parameters are widely used in the study of Information

Extraction. Precision tells us how well the algorithm found the

right answers. Recall tells us how well it picked the right

answers. And F score is used to measure the equilibrium of

Precision and Recall. All of the above can be calculated by the

following equation.

III. PROPOSED FRAMEWORK

In this experiment, the algorithm has two parts as follows

• Training phrase which creates the N-gram

language model and extracts the POS pattern.

• Testing phrase which extracts candidate phrases

from a new document and calculates the degree of

phrase importance.

FIGURE 1. Proposed framework

A. Preprocessing

Our experiment focuses on academic literatures. Thus, the source of raw documents must come from this area.

All the data that is used comes from the academic literatures downloaded from IEEE, SpringerLink. After the documents are collected, they must be transformed into “.txt” format which is the task of the process called Preprocessing.

Normally, the raw documents are in “.pdf” format. Those documents need to be converted into a “.txt” format with the reason of conveniently processing. Raw documents are actually sectioned into several major parts such as title, abstract, keyword and conclusion. Three of them are required, in this experiment, they are, title, abstract and conclusion. Thus, all words from those sections are collected.

Herein, the process of preprocessing, the two units are very similar but they only extract raw content text from the three sections, nothing more is done in this training phase.

B. N-Gram Process

In this paper, we focused on two major techniques. Unigram Extraction is a simple step to extract all words which do not appear in the stopword list. Secondly, the bi-gram list, this extracts all possible phrases which do not begin or end with stopwords. In each list, more additional fields are also

#Correct Extracted Words/Phrases

#Retrieved Words/Phrases

P = (3)

#Correct Extracted Words/Phrases

#Relevant Words/Phrases

R = (4)

2PR

P+R

F = (5)

r = n∑xy – (∑x)(∑y)

√n(∑x

2)-( ∑x)

2 √n(∑y

2)-( ∑y)

2 (2)


71

added in order to increase the speed of processing. They are the number of documents that contain words/phrases and the number of times that words/phrases occur in the corpus.

The criteria of tokenize possible words/phrases are

• Pair of words not mentioned as a phrase when

they are divided by a punctuation mark, those

marks are as follows “full stop, comma, colon,

semi-colon etc.”

• Digits of Number are ignored. Any number is

referred to as a stopword.

C. Candidate Phrases Extraction

All phrases from a converted document will be extracted using bi-grams. The bi-gram tokenization process is similar to the N-gram processing from the training phase but some of the criterion is different. The criterion focuses on ignoring words/phrases that appear only once [5], then it’s removed and replaced by a punctuation mark in order to tokenize the remaining phrase correctly. Tokenization across punctuation marks are not allowed.

The reason of tokenizing all phrases with bi-grams is that

most of the keywords/keyphrases are already composed of bi-

grams. Another reason is, if a phrase is tokenized as tri-grams,

it is inconvenient to foreshorten or lengthen when it is

tokenized in uni-gram. From our literature review, ratio of

uni-gram:bi-grams:tri-grams is 1:6:3. Table I shows an

example of tokenization.

TABLE I. EXAMPLE OF TOKENIZATION

Example

Original

Many universities and public libraries

use IR systems to provide access to

books, journals and other documents.

Web search engines are the most

visible IR application.

Uni-gram

universities, public, libraries,

IR, system, books, journals,

document, web, search, engines,

visible, IR and applications

Bi-gram

public library, IR systems, web search,

search engine, visible IR,

IR applications

D. Weight Calculation and Ranking

Weight calculation is used to score each phrase in our list which is called “Rank”. While the experiment was being conducted, (1) was applied to indentify words/phrases in the document but it did not generate well-enough results. Thus, (1) should be modified to gain a better end result.

In this experiment, there are two parameters added, those are Area Weight (AW) and Word Frequency (f). f is term/phrase frequency in a document. Thus, (1) is modified as (6).

Where, DI is degree of important of each phrase; (fre(p,d)) is frequency of phrase “p” in document “d”; dfp is number of documents that contain phrase “p” in corpus and AW is Integer mark assigned to each section of document and IDF stands for Inverse Document Frequency which able to computed by second term of (1).

AW is arbitrary assigned and adjusted until it generates the best result. The strategy of assigning the weighted score is to intuitively focus on both their physical and logical characteristics such as the size of space and the possibility of there being important information in each of the document sections. For instance, the space size of each area which is covered in this experiment is obvious different, Title is the smallest but, certainly, contains the most important information of a paper. Then, AW is then arbitrary assigned to each section until it generates the best result. The experiment provides the best result when Title, Abstract and Conclusion section are weighted at 7, 2 and 1 respectively.

All phrases are computed for their DI in each section and then, finally, average value is computed. For example, the phrase “information retrieval” occurs 1,3,2 times in each section of document DI which is shown in (7).

(7)

The digit “3”, at divisor, is the number of document section which are mentioned in this experiment.

E. Phrase Filtration

This process is about the filter before releasing the final result as keywords/keyphrase which is recommended by the algorithm. The result from the previous step may be lengthened as a tri-gram (the length is maximum in this experiment). As all phrases are computed in the previous processes into bi-grams, there might be some phrases which are not correct because there is a probability of a word missing. When a word is added in to these phrases, it should become more descriptive.

For example, an expected phrase is “natural language processing” but the phrases in our list are “natural language” and “language processing”. In this case, we may concatenate those phrases by focusing on “language” as a joint. In the example just mentioned, the proposed technique worked properly but may not for other pairs of phrases. The Correlation Coefficient which is a statistic technique was applied to solve this problem in order to concatenate two phrases which have an identical joint instead of [12].

After that, some of the improper phrases may still remain due to improper arrangement. Thus, POS patterns obtained in the previous process are applied. We need to compare functions of each word in each phrase that’s generated previously from our algorithm to patterns extracted from the corpus.

We also have to focus on the subset of a word function. For example, the phrase of “multiple compound sentences” has a pattern, POS, as “JJ, NN, NNS” (adjective, singular noun, plural noun) on the other hand, our algorithm generates

(fre(p,d))2

n DI = (6)

˟ IDF ˟ AW

DI = IDF (1

2˟7) + (3

2˟2) + (2

2˟1)

6˟3


72

“multiple compound sentence” which pattern is “JJ, NN, NN” (adjective, singular noun, singular noun) thus, those phrases are identical. In the case of a word function position is swapped, the phrase is discarded.

IV. EXPERIMENTAL RESULT

In our experiment, the algorithm behavior and its result were observed, the best value of r is found at 0.2. Which means if r of “natural” and “language” and also r of “language” and “processing”, from the example described above, is greater than 0.2, those phrase are concatenated as “natural language processing”. The preset of r value in this experiment being lower than the general statistic theory proposed in section II B, could be due to the data set in the corpus being scattered. For instance, the phrase “natural language” occurs 31 times from 18 documents while “natural” occurs 28 times from 27 documents and 203 times from 53 documents for the word “language”.

This algorithm was trained by 400 documents in the corpus and was applied to 30 academic literatures which were randomly downloaded from the same source of the set data used in the training phase. All literatures were also converted to “.txt” format before processing.

Our algorithm is measure at different amounts of extracted phrases, at 1-10, 15 and at best (best means all phrases in proposed list are mentioned, precision and recall is calculate from the last matched phrase in the proposed answer list). In Fig. 2, the performance measurement of Recall referring to both Correlation Coefficient (CC) and POS pattern (POS) are shown and compared to the Correlation Coefficient application.

Figure 2. Performance comparison with and without POST.

R_1, P_1 represents Recall and Precision when the Correlation Coefficient Technique is applied. R_2, P_2 represents Recall and Precision when the Correlation Coefficient and POST are applied.

Figure 3. Deep detail measurement of Precision and Recall

In this paper, the author also presents the best number of phrases that the algorithm should propose, maximize spot. The best number of phrases is calculated by the F score. Considering figure 3 and 4, the algorithm is suitable to propose no less than 5 phrases to end-user.

Figure 4. Maximize Spot

Finally, our algorithm is compared to a standard method of keyword extraction, TF-IDF, meaning that the degree of importance of each term was calculated by (1). The result is showed in table II.

TABLE II. PERFORMANCE COMPARISON

Average Performance (%)

Recall Precision F score

Standard TF-IDF 47.53 14.37 22.07

Proposed method 60.11 39.62 47.83

Table III and IV shows an example of keyphrases from the proposed method.

5 10 15 at best

R_1 29.26 42.41 47.57 59.28

P_1 20.77 15.77 12.05 34.19

R_2 40.97 49.82 56.14 60.11

P_2 28.46 18.46 14.36 39.62

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

Per

cen

tage

1 2 3 4 5 6 7

P 57.69 40.38 32.05 30.77 28.46 25.64 23.08

R 17.69 25.10 28.43 35.20 40.97 43.66 45.26

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

55.00

60.00

F m

easu

re

Precision and Recall

when CC and POS applied

1 2 3 4 5 6 7

F-Score 27.08 30.96 30.13 32.83 33.59 32.31 30.57

25.00

27.00

29.00

31.00

33.00

35.00

Per

cen

tage

Maximize Spot


73

TABLE III. EXAMPLE OF PROPOSED RESULT 1

Literature title

Concept Detection and Keyframe

Extraction

Using a Visual Thesaurus

Original

Keywords

Concept Detection, Keyframe Extraction,

Visual thesaurus, region types

top 7

Keywords

Proposed

in bi-gram

Visual Thesaurus

model vector

Concept Detection

region types

keyframe extraction

detection performance

vector representation

top 3

Keywords

Proposed

in tri-gram

Concept detection performance

shot detection scheme

exploiting laent relations

TABLE IV. EXAMPLE OF PROPOSED RESULT 2

Literature title An Improved Keyword Extraction Method

Using Graph Based Random Walk Model

Original

Keywords

keyword extraction, random walk model,

mutual information, term

Position, information gain

top 7

Keywords

Proposed

in bi-gram

Keyword Extraction

improved keyword

information gain

extraction method

method using

mutual information

extraction using

top 3

Keywords

Proposed

in tri-gram

Random walk model

mutual information gain

using inspect benchmark

V. DISCUSSION AND FUTURE WORK

This paper proposes an algorithm that’s able to extract phrases that match more than half of the original keyphrases which are assigned by the author, meaning the result is determined as acceptable. Moreover, it uses less training sets to

achieve this result at 47.83% of precision. Furthermore, this study focuses on applying this method to develop an application for real situations. Therefore, the proposed model is built as simple as possible. The only disadvantage of the corpus is that the data set is not clustered in a unique narrow dimension but a broad one. But, the broad dimension of the dataset makes training sets rather natural and close to ordinary language which is the biggest advantage.

Due to this experiment still in progress, there are some tasks that need to be revised. The author is planning to expand the size of corpus in term of the number of document in training set and to cover other fields of educational literature in order to observe a higher number of end results.

ACKNOWLEDGMENT

The authors would like to thank Asst.Prof. Dr.Supot Nitsuwat for sharing good ideas and his consultations. Dr.Gareth Clayton and Dr.Sageemas Na Wichian for statistical techniques and their experiences. Mr. Ian Barber for POST tool contribution and Acting Sub Lt. Nareudon Khajohnvudtitagoon for his development techniques. Last but not least, the faculty of Information Technology at King Mongkut’s University of Technology.

REFERENCES

[1] Md. R. Islam and Md. R. Islam, “An Improved Keyword Extraction Method Using Graph Based Random Walk Model,” 11th Int. Conference on Computer and Information Technology, pp. 225-299, 2008.

[2] Z. Qingguo and Z. Chengzh, “Automatic Chinese Keyword Extraction Based on KNN for Implicit Subject Extraction,” Int. Symposium on Knowledge Acquisition and Modeling, pp. 689-602, 2008.

[3] H. Liyanage and G.E.M.D.C.Bandara, “Macro-Clustering: improved imformation retrieval using fuzzy logic,” Proc. Of the 2004 IEEE Int.

Symposium on Intellignet Control, pp.413-418, 2004. [4] E. Brill, “A simple rule-based port of speech tagger,” Proc. of the third

conference on Applied natural language processings, pp. 152-155, 1992.

[5] I. H. Witten, G. W. Paynter, E. Frank, C. Gutwin and C. G. Nevill-Manning, “KEA: Practical Automatic Keyphrase Extraction”, Proc. of the fourth ACM conference on Digital libraries, ACM, 1999.

[6] K. Sarkar, M. Nasipuri and S. Ghose, “A new Apporach to keyphrase Extraction Using Neural Networks”, Int. Journal of Computer Science Issues vol.7, Issue 2, 2010.

[7] Mathbits.com, Correlation Coefficient. Available at: http://mathbits.com/mathbits/tisection/statistics2/correlation.htm

[8] WikiPedia, Part-of-Speech Tagging. Available at: http://en.wikipedia.org/wiki/Part-of-speech_tagging

[9] University of Pensylvania, “Alphabetical list of part-of-speech tags used in the Penn Treebank project. Available at: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

[10] X. Hu and B. Wu, “Automatic Keyword Extraction Using Linguisic Feature”, Sixth IEEE International Conference on Data Mining- Workshop (ICDMW ‘06), pp. 19-23, 2006.


74

Using Example-based Machine Translation for English – Vietnamese Translation

Minh Quang Nguyen, Dang Hung Tran and Thi Anh Le Pham Software Engineering Department, Faculty of Information Technology Hanoi

National University of Education quangnm, hungtd, [email protected]

Abstract

Recently, there is a significant amount of advantages in Machine Translation in Vietnam. Most approaches are based on the combination between grammar analyzing and a statistic-based method or a rule-based method. However, their results are still far from the human expectation.In this paper, we introduce a new approach which uses the example-based machine translation approach. The idea of this method is that using an aligned pair of sentences which is in Vietnamese and English and an algorithm to retrieve the most similar English sentence to the input sentence from the data resource. Then, we make a translation from the sentence retrieved. We applied the method to English-Vietnamese translation using bilingual corpus including 6000 sentence pairs. The system approaches feasible translation ability and also achieves a good performance. Compare to other methods applied in English-Vietnamese translation, our method can get a higher translation quality.

I. Introduction Machine translation has been studied and

developed for many decades. For Vietnamese, there are some projects which proposed several approaches. Most approaches used a system based on analyzing and reflecting grammar structure (e.g. rule-based and corpora-based approaches). Among them, the rule-based approach is a trend of direction on this field nowadays; with bilingual corpus and grammatical rules built carefully [7].

One of the biggest difficulties in rule-based translation as well as other methods is data resources. An important resource that is required for translation is the thesaurus which needs lots of effort and work to build [9]. This dataset, however, do not meet the human’s requirements yet. In addition, almost traditional methods also require knowledge about languages applied so it takes time to build a system for new languages [5, 6]. The Example Based Machine Translation (EBMT) is a new method, which relies on large corpora and tries somewhat to reject traditional linguistic notions [5].

EBMT systems are attractive in that output translations should more sensitive to contexts than rule-based systems, i.e. of higher quality in appropriateness and idiomaticity. Moreover, it requires a minimum of prior knowledge beyond the corpora which makes the example set, and are therefore quickly adapted to many language pairs [5].

EBMT is applied successfully in Japanese and American in some specific fields [1]. In Japanese, they built a system achieving a high-quality translation and also an efficient processing in Travel Expression [1]. In Vietnamese, however, there’s no research following this method although the fact is that to apply in English-Vietnamese translation, this method doesn’t require too many resources and linguistic knowledge. We only have an English-Vietnamese Corpus data set in Ho Chi Minh National University – the significant data resource with 40.000 pair of sentences (in Vietnamese and English) and about 5.500.000 words [8].

We already have the English thesaurus and English- Vietnamese dictionary. About the set of aligned corpora, we have made 5.500 items for the research.

In this paper, we use EBMT knowledge to build a system for English-Vietnamese translation. We will apply graph based method [1] to Vietnamese language. In this kind of paradigm, we have a set, each item in this set is a pair of two sentences: one in the source language and one in the target language. From an input sentence, we carry out from the set a item which is the most similar sentence to the input. Finally, from the example and the input sentence, we adjust to provide a final sentence in target language. Unfortunately, we don’t have a Vietnamese, thesaurus so we proposed some solutions for this problem. In addition, this paper proposes a method to adapt the example sentence to provide the final translation.

1. EBMT overview: There are 3 components in a conventional example based system:

- Matching Fragment Component. - Word Alignment. - Combination between the input and the example sentence carried out to provide the final target sentence.

For example:


75

(1) He buys a book on international politics (2) a. He buys a notebook.

b. Anh ấy mua một quyển sách. (3) Anh ấy mua một quyển sách về chính trị thế giới.

With the input sentence (1), the translation (3) can be provided by matching and adapting from (2a, b).

One of the main advantages of this method is that, we can improve the quality of translation easily by widen the amount example set. The more items add, the better we have. It’s useful to apply for a specific field because the limit of form of the sentence included in these fields. For example, we use it to translate manuals of product, or weather forecast, or medical diagnosis.

The difficulty to apply EBMT in Vietnam is that, there’s no word-net in Vietnamese, so we promote some new solutions to this problem.

We build a system with 3 steps: - Form the set of example sentences, the result is the set of graphs. - Carry out the most popular example sentence to the input sentence. From an input sentence, using “edit distance” measuring, the system will find sentences which are the most similar to it. Edit- distance is used for fast approximate between sentences, the smaller distance, and the greater similarity between sentences.- Adjust the gap between the example and the input.

2. Data resource: We use 3 resources of data. That is:

- Bilingual corpus: this is the set of example sentences. This set includes pairs of sentences. Each sentence is performed as a word sequence. Spreading the size of the set will improve the quality of translation.

- The Thesaurus: A list of words showing similarities, differences, dependencies, and other relationships to each other.

- Bilingual Dictionary: We used the popular English Vietnamese dictionary file provided by Socbay Company.

3. Build the graph of example set. The sentences are word sequences. We divide the

words into 2 groups - Functional word: Functional words (or grammatical

words or auto-semantic words) are words that have little lexical meaning or have ambiguous meaning, but instead serve to express grammatical relationships with other words within a sentence, or specify the attitude or mood of the speaker.

- Content word: Words that are not function words are called content words (or open class words or lexical words): these include nouns, verbs, adjectives, and most adverbs, although some adverbs are function words (e.g., then and why).

We classify the set into sub set. Each set includes sentences with the equal amount of content words and functional words.

Based on the division, we build a group of graphs – word graphs: - They are directed graphs including start node and goal node. They includes nodes and edges, an edge is labeled with a word. In addition, each edge has its own source node and destination node. - Each graph performs a sub set. Each sub set includes sentences with the same total of content word and the same total of functional word. - Each path from start node to goal node performs a candidate sentence. To optimize the system, we have to minimize the size of word graph. Common word sequences in different sentences use the same edge.

Figure 1: Example of Word Graph

The word graphs have to be optimized with the minimum number of node. We use the method of converting finite state automata [3, 4]. After preparing all resources for this method, we will execute 2 steps of it: example retrieval and adaption:

4. Example retrieval: We use the A*Search algorithm to approach the

most similar sentences from word graph. The result of matching between two word sequences is a sequence of substitution, deletions and insertions. The search process in a word graph is to find a least distance between the input sentence and all the candidates perform in graph.

As a result, matching sequences of path are approached as records which include a label and one or two words.

Exact match: E(word) Substitution: S(word, word) Deletion: D(word) Insertion: I (word)

For example: Matching sequence between the input sentence We close the door and the example She closes the teary eyes is:


76

S(“She”, “We”) – E(“close”) – E(“the”) – I(“teary”) – S(“eyes”).

The problem here is that we have to pick a sentence with the least distance to the input sentence. We firstly compare the total of E- records in each matching sequence, and then we compare S-records and so on.

5. Adaption: From the example approached, we adapt it to provide

the final sentence in target language for input sentence by insertions, substitutions and deletions. To find the meaning of English words, we used morphological method.

5.1. Substitution, deletion and exaction: We will find the right position for the word in the

final sentence for substitution, deletion and exaction. With deletion, we do nothing, but the problem here is that we have to find to meaning of word in substitution and deletion records. - There are some different meanings of a word, which one will be chosen? - Words in the dictionary are all in infinitive form while they can change to many other forms in the input sentence.

We help to solve this problem carefully. Firstly, we find the type of word (noun, verb, adverb…) in the sentence. We use Penn Tree Bank tagging system to specify the form of each word. Secondly, based on the form of word, we seek the word in the dictionary: If the word is plural (NNS): - If it ends with “CHES”, we try to delete “ES” and “CHES”, when the deletion makes an infinitive verb; we find the meaning in dictionary. Other case, it is specific noun. - If it ends with “XES” or “OES”, we delete “XES” or “OES” and find the meaning. - If it ends with “IES”: replace “IES” by “Y”. - If it ends with “VES”: replace “VES” by “F” or “FE”. - If it ends with “ES”: replace “ES” by “S”. - If it ends with “S”: delete “S”.

After finding the meaning of plural, we add “những” before its meaning. If the word is gerund: - Delete “ING” at the end of the word. We try two cases. First is the word without “ING” and second is the word without “ING” and with “IE” at the end. If the word is VBP: - If the word is “IS”: it’s “TO BE”. If it ends with “IES”: replace “IES” by “Y” - If it ends with “SSES”: erase “ES” - If it ends with “S”: erase “S” If the word is in the past participle or past form:

- Check the word if it is included in the list of irregular verb or not. If it’s included, we use the infinitive form to find the meaning. The list of irregular verb is performing as red-black tree to make the search easier and faster. - If it ends with “IED”: erase “IED”. - If it ends with “ED”: check the very last 2 letter before “ED”, if they are identical then we erase 3 last letter of word. Otherwise, we erase “ED”. If the word is in present continuous form, we find the word in the same way with gerunds. After that we add “đang” after the meaning. - If the word is JJS: Delete 3 and 4 last 4 consonant and find the meaning in the dictionary. After infinitive form of word is found, we use bilingual dictionary to seek the meaning. The problem is that, when we reach the infinitive form of word, since there are many meanings with a kind of words, we have to choose the right one. In our experiment, we take the first meaning in the bilingual dictionary.

5.2. Insertion: The problem here is that we don’t know the exact

position to fill the Vietnamese meaning. If we choose the position as the position of Insertion record in matching sequence, the final sentence in Vietnamese will be in low quality. We have to use the theory of ruled-based machine translation to solve this problem. We can use it in some specific phrase to find the better position instead of the order of records.

Firstly, link grammar system will parse the grammatical structure of sentence. The Link Grammar Parser is a syntactic parser of English, based on link grammar, an original theory of English syntax. Given a sentence, the system assigns it a syntactic structure, which consists of a set of labeled links connecting pairs of words. The parser also produces a "constituent" representation of a sentence (showing noun phrases, verb phrases, etc.).

From the grammatical structure of sentence, we find out some phrases in English which need to change the order of word to translate into Vietnamese. For example, the noun phrase “nice book”, with 2 I-records: I(nice) and I(book), we used to translate into “hay quyển sách” instead of “quyển sách hay”. With link grammar, we know the exact order to translate. Some phrases to process:


77

Length of sentence (words)

0 – 5 5 – 10 10 - 15 15 - 30

Threshold 2 3 4 6

5.3. Example:

Input sentence: This nice book has been bought Example retrieval: the most similar example with input sentence found out is This computer has been bought by him. Sequence of records: E(“This”) – I(“nice”) – S(“computer”, “book”) – E(“has”) – E(“been”) – E(“bought”) – D(“by”) – D(“him”).

With link grammar, there a noun phrase within the sentence “This nice book”, with the records E(“This”), I(“nice”), S(“computer”, “book”) respectively. We reorder the sequence: S(“computer”, “book”) – I(“nice”) – E(“This”) – E(“has”) – E(“been”) – E(“bought”) – D(“by”) – D(“him”).

Based on new records sequence and the example, the adaption phase will be processed: - Exact Match: Keep the order and the meaning of word. “” – “” – “” – “này” – “được” – “mua” – “” – “” – “” - Substitution: Find the meaning of word in input sentence, replace the word in example by it. “Quyển sách” – “” – “” – “này” – “được” – “mua” – “” – “” – “” - Deletion: Just erase the word in example. “Quyển sách” – “” – “” – “này” – “được” – “mua” – “” – “” – “” - Insertion: We now have the right order of record, so we just finding the meaning of word in Insertion record and put it in order of the record in sequence. “Quyển sách” – “hay” – “” – “này” – “được” – “mua” – “” – “” – “”.

After 4 steps of adaption, we provide the final sentence: “Quyển sách hay này được mua”

6. Evaluation:

6.1. Experimental Condition:

We made manually an English-Vietnamese corpus including 6000 pairs of sentences.

To evaluate translation quality, we employed subjective measure.

Each translation result was graded into one of four ranks by bilingual human translator who is native speaker in Vietnamese. The four ranks were:

A: Perfect, no problem with both grammar and information. Translation quality is nearly equal to human translator.

B: Fair. The translation is easy to understand but some grammatical mistake or missing some trivial information.

C: Acceptable. The translation is broken but able to understand with effort.

D: Nonsense: Important information was translated incorrectly.

The English - Vietnamese dictionary used includes 70,000 words. To optimize the processing time, a threshold is used to limit the result set of Example retrieval phase. Table 2 show the threshold we used to optimize example retrieval phase with sentence’s length smaller than 30. If length of input sentence is greater than or equal to 30, threshold is 8. Table 2: Value of threshold

1 Noun phrase: POS(1, 2) = (JJ, NN). Reorder: (NN, JJ)

2 Noun phrase: POS(1, 2, 3) = (DT, JJ, NN) && word1 = this, that, these, those. Reorder: (NN, JJ, DT)

3 Noun phrase: POS(1, 2) = (NN1, NN2). Reorder: (NN2, NN1) 4 Noun phrase: POS(1, 2) = (PRP$, NN). Reorder: (NN, PRP$)

5 Noun phrase: POS(1, 2, 3) = (JJ1, JJ2, NN). Reorder: (NN, JJ2, JJ1)

Table 1: Some phrase to process with Link Grammar


78

Rank A B C DTotal 25 11 3 11 Average length of sentence

9.3 6.3 7.8 8.4

Rank A B C D Total 15 10 3 22 Average length of sentence

5.7 5.6 6.0 8

6.2. Performance: For the experiment, we create two test sets: a test set of random sentence with complex grammatical structure and a set of 50 sentences edited from the training set. Under these conditions, the average processing time is less than 0.5 second for each providing each translation. Although the processing time increases as the corpus size increases, the increasing scale is not linear but about a half power of corpus size. Compare to DP-matching [2], the method used to retrieve example with word graph and A*Search achieves efficient processing. Using the threshold 0.2 with random sentences, where the time processing is significantly decreased, and the translation quality is low. The reason is that we used the bilingual corpus with size is too small. As a result, examples approached are not similar enough to the input sentence. There are two ways to increase translation quality. Firstly, we widen the size of example set. Secondly, since we have not any appropriate way to choose the right meaning from bilingual dictionary, we apply the context-based translation to EBMT system. The tabe 3 and the table 4 illustrate the evaluation of result.

Table 3: Set of edited sentences and performance

Precision: 70%

Table 4: Set of random sentences and performance

Precision: 50% For the set of edited sentences (table 3), the system

reached high translation quality with the Precision of 70%. Items in this set have grammar structure and word type similar to example set, this makes EBMT system find the suitable sentences to translate.

For the set of random sentences (table 4), because it contains a number of complex sentences, so that the examples EBMT system reach is not similar enough to the input, consequently, the result has low quality (Only 50% sentences translated with quality rank of A or B, the rest is at rank C or D).

System can translate sentences with complex grammatical structure as long as the example approached is similar enough to the input sentences. 7. Conclusion:

We report on a retrieval method for an EBMT system using edit-distance and evaluation of its performance using our corpus. In experiments for performance evaluation, we used bilingual corpus comprising 6000 sentences from every field. The reasons cause some low quality translation is the small size of bilingual corpus. The EBMT system will provide a better performance when it runs into a specific field. For example, we use EBMT to translate manuals of productions, or introductions in travel field. Experiment results show that the EBMT system achieved feasible translation ability, and also achieved effort processing by using the proposed retrieval method.

Acknowledgements:

The author’s heartfelt thanks go to Professor Thu Huong Nguyen, Computer Science Department, Hanoi University of Science and Technology for supporting the project, Socbay linguistic specialists for providing resources and helping us to test the system.

Reference [1] Takao Doi, Hirofumi Yamamoto and Eiichiro

Sumita. 2005. Graph-based Retrieval for Example-based Machine Translation Using Edit-distance.

[2] Eiichiro Sumita, 2001. Example-based machine translation using DP-matching between word sequences.

[3] John Edward Hopcroft and Jeffrey Ullman, 1979. Introduction to Automata Theory, Languages and Computation. Addison - Wesley, Reading, MA.

[4] Janusz Antoni Brzozowski, Canonical regular expressions and minimal state graphs for definite events, Mathematical Theory of Automata, 1962, MRI Symposia Series, Polytechnic Press, Polytechnic Institute of Brooklyn, NY, 12, 529–561.

[5] Steven S Ngai and Randy B Gullett, 2002. Example-Based Machine Translation: An Investigation.

[6] Ralf Brown, 1996. “Example-Based Machine Translation in the PanGloss System.” In Proceedings of the Sixteenth International Conference on Computational Linguistics, Page 169-174, Copenhagen, Denmark.


79

[7] Michael Carl. 1999: “Inducing Translation Templates for Example-Based Machine Translation,” In the Proceeding of MT-Summit VII, Singapore.

[8] Đinh Điền, 2002, Building a training corpus for word sense disambiguation in English-to- Vietnamese Machine Translation.

[9] Chunyu Kit, Haihua Pan and Jonathan J.Webste., 1994. Example-Based Machine Translation: A New Paradigm.

[10] Kenji Imamura, Hideo Okuma, Taro Watanabe, and Eiichiro Sumita, 2004. Example-based Machine Translation Based on Syntactic Transfer with Statistical Models.

1994. Example-Based Machine Translation: A New Paradigm.


80

Cross-Ratio Analysis for Building up The Robustness of Document Image Watermark

Wiyada Yawai

Department of Computer Science, Faculty of Science King Mongkut’s Institute of Technology Ladkrabang

Bangkok, Thailand E-mail: [email protected]

Nualsawat Hiransakolwong

Department of Computer Science, Faculty of Science King Mongkut’s Institute of Technology Ladkrabang

Bangkok, Thailand E-mail: [email protected]

Abstract—This research presents the applied cross-ratio theory effectively used for building up the robustness of invisible watermarks which embedded in multi-language; English, Thai, Chinese, and Arabic, grayscale document image against the geometric distortion attacks; scaling, rotating, shearing and other manipulating, like noise signal adding, compression, sharpness, blur, brightness and contrast adjusting, that occurs while scanning the embedded watermarks for verifying them. These attacks are simulated to test the effectiveness of cross-ratio theory initiatively used to enhance this mentioned watermark robustness for any document image of any language which normally is the main limitation of other watermarking methods. This theory is using 4 corners and two diagonals of a document image as the reference for watermark embedding lines located between text lines crossing against two diagonals and vertical lines of both sides according to specified cross-ratio values. For watermark embedding positions on each line will be calculated from another set of cross-ratio. Cross-ratio values of each line will be different in accordance with preset patterns. Detection of watermarks in document images is not necessary to be converted image or compared with original image. Our approach can be detected through calculation of referred 4 corners of the image and applied correlation coefficient equation to directly compare against original watermarks. Testing revealed that it could build up reasonably robustness against scaling, at range 11% up, shearing ( 0 - 0.05), rotating (1 – 4 degrees), compressing, range 60% up, contrasting (1 – 45%), sharpness (0 – 100%) and blur filtering at mask size less than 13x13.

Keywords-Digital watermarking; Document image; Robustness; Geometric distortion; Cross ratio; Collinear points

I. INTRODUCTION

Digital watermarking is one of the processes of hiding data for protecting copyright of digital media either in forms of audio, video, text, etc. There are two categories of watermarking; visible watermarking and invisible watermarking. The major purpose of watermarking is to protect copyright of media through creating various forms of obstructions to violators.

This research is particularly emphasized on applying the cross ratio theory to create the robust watermark data embedded in the grayscale document image which must be

survived and easily detected even it has been attacked in many possible ways, especially geometric distortion attacks which mostly not been explored in other document image watermarking researches. Most existing researches are focused on watermarking an electronic text or document file, instead of document image, of one specified language, instead of multi-language, and emphasized the watermark embedding technique instead of watermark robustness. These existing document watermarking researches can be categorized, by their watermarking technique, into 3 techniques as follows.

Technique I: Watermark embedding with text document physical layout/pattern/structure rearrangement, such as shifting of lines [1] and words, particularly binding of word spacing, word shift coding or word classification [2][3][4][5]. This technique can be applied to both watermark electronic document file and document image. However, it has some disadvantages, for instance, line shifting technique of Brassil et will be low robust to document passing through document processing, page skewing/rotation; between -3o - +3 o, noise signal adding attack and a short text line. Another limitation of this process is that it can only apply to document with spaces between words, spacing of letters, shifting of baselines or line shift coding. Word shifting algorithm has also developed by Huang et al. [5], it’s based on adjusting inter-word spaces in a text document so that mean spaces across different lines show characteristics of a sine wave where information or watermark can be encoded or embedded in the sine wave(s) for both horizontal and vertical directions.

Min Du et al. [6] proposed a text digital watermarking algorithm based on human visual redundancy. According to that the human eye is not sensitive to the slight change for text color, watermarks were embedded by changing the low-4 bits of RGB color components of characters. This proposed method has good invisibility and robustness which depending on its redundancy. However, this research tested its robustness against word deleting and modifying only.

Technique II: Embedding text watermark by text character/letter feature modifying, for example, Brassil, et a l. [2] have used the letter adjusting by reducing or increasing length of letter; such as, increasing length of letters b, d, or h. For principle applying to extract hidden data out of document can be done by comparing hidden data document against original document. The limitation of this process is that the


81

hidden data will be so little robust to document passing through document processing.

Applying arithmetic expression to replace characteristic of letter with close component has developed by W. Zhang et al.[7] which has applied arithmetic expression to replace characteristic of letter with close component (in square form) which the hiding is done by adjusting sizes of those characters in document file. This process is robust against attacks or destruction and unable to observe. The test referred that the mentioned hiding is durable and more difficult to observe than those of the line-shift coding, word-shift coding, and character coding but has not presented the robustness testing information in the research document. However, this research is only be applied to Chinese characters and subject to be further researched since the process has tested only the character replacement attack, not yet against durability to various forms of watermark attacks.

Shirali-Shahreza et al. [8] has applied changing of characters in Persian that a number most characters have their distinction in their peaks (Persian letter NOON) of these characters for hiding. Due to defects in using OCR in reading Persian and Arabic document image; therefore, reading of printed text from these characters for extracting hidden data is considered complicate, especially after attacking that has not yet tested.

Suganya et al. [9] proposed to modify perceptually significant portions of an image to make the algorithm that watermark is hidden in the point’s location of the English letter i and j. first few bits are used to indicate the length of the hidden bits to be stored. Then the cover medium text is scanned to store a one, the point is slightly shifted up else it remains unchanged. However, this research did not refer to any robustness testing result.

Technique III: Watermarking with semantic schemes or word/vowel substitution; Topkara et al. [10] has developed a technique for embedding secret data without changing the meaning of the text by replacing words in the text by synonyms. This method deteriorates the quality of the document and a large synonym dictionary is needed.

Samphaiboon et al. [11] proposed a steganographic scheme for electronic Thai plain text documents that exploits redundancies in the way particular vowel, diacritical, and tonal symbols are composed in TIS-620, the standard Thai character set. The scheme is blind in that the original carrier text is not required for decoding. However, it can be used with only Thai text document and its watermark data is so easy to be destroyed by reediting with a word processing program.

The following presenting research has been studied on text document image (not electronic document file), scanned or copied from an original document paper, watermarking by applying the cross-ratio theory in collinear point type, in order to build up its watermark robustness against various forms of attacks, particularly geometric distortions; scaling, shearing and rotation and other manipulations; data compression, noise signal adding and brightness, contrast, scale, sharpness and blur adjusting.

II. THE CROSS RATIO OF FOUR COLLINEAR POINTS

The cross-ratio is a basic invariance in projective geometry (i.e., all other projective invariance can be derived from it). Here brief introduction to the cross-ratio invariance property is given.

Let A, B, C, D be four collinear points (Three or more points A, B, C,… are said to be collinear if they lie on a single straight line[12]) as shown in Fig. 1. The cross-ratio is defined as the “double ratio” in Eq. (1)

ADBCBDACDCBA⋅⋅

=),;,( (1)

where all the segments are thought to be signed. The cross-ratio does not depend on the selected direction of the line ABCD, but does depend on the relative position of the points and the order in which they are listed. Based on a fundamental theory, any homography preserves the cross-ratio. Thus central projection, linear scaling, skewing, rotation, and translation preserve the cross-ratio [13].

Figure 1. Collinear points A, B, C, and D

III. ANALYSIS OF DIGITAL WATERMARKING FOR DOCUMENT IMAGE

To apply the cross ratio to digital image watermarking,

three reference points are required. In this section, a method for deriving such reference points is detailed.

A. Definition

baCC is line from an origin point aC to a destination

point bC where =a 1, 2, 3,4 and =b 1, 2, 3, 4

Cr = (CA/CD) : BA/BD = (CA/CD) (BD/BA)

R = (AC/CD)/Cr BA = (AD * R) / (1 + R) DsB is distance of BA/AD is equal to the value of BA/DA

B. Embedding Scheme

Let’s start by considering the embedding part. The method is described algorithmically below.

1) Predefine the set of cross-ratio values, to be used in subsequent steps.

2) Find the image center, as denoted by cD , by using the line intersection formula [14] (two diagonal lines of the image) as described by Eqs.(2) ~ (3) below.


82

btc xxx /= (2)

btc yyy /= (3)

23

41

xxb

xxaxt −

−=

2323

4141

yyxx

yyxxxb −−

−−=

23

41

yyb

yyayt −

−=

2323

4141

yyxx

yyxxyb −−

−−=

where a= 44

11

yxyx

, b = 22

33

yxyx

. In addition, ),( ii yx is

the coordinate of the point 4,...,1, =iCi (see Fig. 2).

From the above equations, cx is the x-axis value of the

point cD of two-line intersection; 41CC intersect 32CC , and

cy is the y-axis value of the same point and denotes a determinant operator as shown in Fig. 2.

3) Find each of the primary-level watermark embedding

points ( iLUD , and iLDD , ) on the left diagonal line (see Fig. 2)

as described by Eqs.(4) ~ (7) below. Those points can be

identified by using two corner points of the left diagonal line

(C1 and C4), in combination with the image center point cD as

shown in Fig. 2(a) and the predefined cross-ratio values ( rC )

)( 141, xxDsBxx iLU −×+= (4)

)( 141, yyDsByy iLU −×+= (5)

)( 141, xxDsBxx iLD −×+= (6)

)( 141, yyDsByy iLD −×+= (7)

where ( iLUx , , iLUy , ), i = 1, …. LUM , is the coordinate of the point iLUD , , A = C1, B = iLUD , , C= cD and D = C4 . In addition, ( iLDx , , iLDy , ), i = 1, …. LDM , is the coordinate of the point iLDD , , A = C1, B = iLDD , , C= cD and D = C4 .

4) Find each of the watermark embedding points ( iRUD , and iRDD , ) on the right diagonal line (see Fig. 2(b)) by following the steps and equation similar to those detailed in Step 3. However, now the point A in Eqs. (8) ~ (11) represents the point C2 while the point B now represents the point C3. By using these substitutions, those embedding points are given by

)( 232, xxDsBxx iRU −×+= (8)

)( 232, yyDsByy iRU −×+= (9)

)( 232, xxDsBxx iRD −×+= (10)

)( 232, yyDsByy iRD −×+= (11)

where ( iRUx , , iRUy , ), i = 1, …. RUM , is the coordinate of the point iRUD , , A = C2, B = iRUD , , C= cD and D = C3. In addition, ( iRDx , , iRDy , ), i = 1, …. RDM , is the coordinate of the point iRDD , , A = C2, B = iRDD , , C= cD and D = C3.

(a)

(b)

Figure 2. Notations of collinear points A, B, C, and D, defined in cross-ratio

equation, on the left (a) and right (b) diagonal line of the document image.

5) For each pair of iLUD , , iRUD , , and iLDD , , iRDD , levels, find an intersection, ii yx , ,of crossed line of each level drawn across left side; 1,1, ... LDLU LL and right side;

1,1, ... RDRU LL of document image borders (see Fig. 3(a)); 31CC and 42CC by applying Eqs. (12) ~ (13)


83

bti xxx /= (12)

bti yyy /= (13)

43

21

xxbxxa

xt −−

=4343

2121

yyxxyyxx

xb −−−−

=

43

21

yybyya

yt −−

=4343

2121

yyxxyyxx

yb −−−−

=

where a = 22

11

yxyx , b=

44

33

yxyx

6) Find each of the watermark embedding points ( kiHUE ,, and kiHDE ,, ) on the watermarked embedding lines (see Fig. 3(b)) Eqs. (14) ~ (17) represents the embedding points ( kiHUE ,, and kiHDE ,, ) are given by

)( ,,,,, iLUiRUiLUkiHU xxDsBxx −×+= (14)

)( ,,,,, iLUiRUiLUkiHU yyDsByy −×+= (15)

)( ,,,,, iRDiRDiLDkiHD xxDsBxx −×+= (16)

)( ,,,,, iRDiRDiLDkiHD yyDsByy −×+= (17)

where ( iLUx , , iLUy , ), i = 1, …. LUM , A = iLUL , ,

B = kiHUE ,, , C = iRUD , and D = iRUL , . In addition,

( iLDx , , iLDy , ), i = 1, …. LDM , A = iLDL , , B = kiHDE ,, , C

= iRDD , and D = iRDL , .

7) From all watermark embedding points, embed the

watermark patterns by means of a spread-spectrum principle [15] using the following equations

Given the set of watermark embedding points Ek = (xk, yk), k = 1, …M, and each of the watermarking pattern bits wk, wk∈ 1,-1, k = 1…. M, each watermarking pattern bit is embedded to the original image by using the following Eq. (18)

),( kn

kme yxI

= kkn

km wyxI α+),( (18)

where mxx kkm += , m = -P, . . . , P, nyy k

kn += , n = -Q, . . . ,

Q and α = strength of watermark.

(a)

(b)

Figure 3. (a) Notations of horizontal lines intersect with 2 diagonal lines and left and right border lines in text document image. (b) Notations of collinear points used for embedding 20 invisible actual watermark pattern bits in the

document image of English, Thai, Chinese, and Arabic languages.

C. Detection Scheme To detect a watermark from the document image '

eI , the four image corner points must first be detected. This can be achieved, for example, by using any of the existing corner detection algorithms. Once the four corner points are detected watermark embedding points must be identified. Each point


84

can be calculated by using the method similar to that of the embedding stage (see Section A. for details). By extracting the values of the pixels corresponding to those watermark embedding points, denoted by ),('

kke yxI , a watermark can be detected by using any of the existing watermark detectors. Here, we adopt the correlation coefficient detector [15]. The correlation coefficient value is computed by the following equation.

)~~)(~~(

)~~(),( '

kkee

kekecc

WWII

WIwIZ××

×= (19)

where kkkeee WWWIII −=−=~,~ '

Watermark is detected if the correlation coefficient value is

greater than a detection threshold. For example, in the experiment that follows, the detection threshold is 0.5.

IV. EXPERIMENT

Under this computer simulation experiment, 35 grayscale multi-language document images, size of 1240x1754 pixels, were used to add 20 different invisible actual watermark patterns of length 100 bits, α is 3, block size of watermark 5x5 pixels/watermark pattern bit. The cross-ratio values used for watermark embedding and detecting were 120 values.

The results of experiment for digital embedding in 35 grayscale document images comprising of the images with text in English, Thai, Chinese, and Arabic applied with 20 various watermark patterns (see Fig. 4), through measuring of watermark values from correlation coefficient by fixing threshold value equal to 0.5 (if there are watermarks in text document images with threshold value from 0.5 onward, if there is no watermarks its value must be less than 0.5) revealed the reasonable watermark robustness enhancement of the cross-ratio applying.

Figure 4. Show some examples of 20 random invisible actual watermark pattern bits which were created and embedded between each text line of

English, Thai, Chinese, and Arabic text document images.

Firstly, tested the controlled document image without watermark by comparing with watermark pattern could obtain value of correlation coefficient = 0, while the image with watermark pattern obtained value of correlation coefficient = 1.

After that testing the cross-ratio watermarking robustness with 9 attacks, including 3 geometric distortions; shearing, scaling and rotating and 6 manipulations; compression, sharpness, brightness, contrast, blur masking and noise signal adding. The actual watermark detecting results can be classified their effects into three groups as follows:

Group I: No effect on actual watermark robustness has been found under attacking of sharpness which shown the correlation coefficient = 1, for all percentage of sharpness filtering variation, range 0 – 100%.

Group II: Very low effect on actual watermark robustness has been found under attacking of compression, at the range 60 – 100 % of JPEG compression quality (see Fig. 5), scaling, at the range 11 – 60% of scaling factor (see Fig. 6), blur, at the range 3x3 – 13x13 of blur filtering mask size (see Fig. 7), contrasting, at the range 1 – 45%, shearing, at the range 0 – 0.05 and rotating, at the range of angle between 1- 4 degrees (see Fig. 8) which shown the acceptable correlation coefficient values between 0.5 - 1, for all kinds of attacking values specified above.

Group III: High effect on actual watermark robustness has been found under attacking of brightness, at the range higher than 5%, Salt & Pepper noise signal adding, at the range higher than 1.5%, and Gaussians noise signal adding, at all ranges shown unacceptable correlation coefficient values which mostly near “0”. It has shown that noise disturbing

Figure 5. Correlation coefficient of watermarked multi-language document images which shown that all invisible actual watermarks still be

reasonably detected after making the JPEG compression quality (%) down to 60% level.

Figure 6. Correlation coefficient of watermarked multi-language document images in scaling factors which shown its robustness if varied

scaling factors between 11% and 120 %.


85

Figure 7. Correlation coefficient of watermarked multi-language document images after attacked with blur filtering mask size between 3x3

and 15x15. It shown that its robustness could be kept at the blur filtering mask size 13x13.

Figure 8. Correlation coefficient values of watermarked multi-language document images which rotation angle varied from 1 to 4

degrees which has still shown its robustness.

signals is the most complicated factor to affect watermark detecting, if there are more disturbing signals, detecting for watermarks can be difficult to be done.

V. CONCLUSIONS

The correlation coefficient measurement, acceptable values between 0.5 – 1, which has been used for detecting the invisible grayscale watermark existing on the multi-language document image file, has shown that the cross-ratio theory applying could be effectively used to build up the reasonably watermarking robustness against the geometric distortion attacks; scaling, especially at the range higher than 11%, shearing (0 – 0.05) and rotating (1 – 4 degrees) and some manipulating attacks; compressing, at the range higher than 60%, contrasting (1 – 45%), sharpness (0 – 100%) and blur filtering which mask size should not be greater than 13x13. This built-up robustness is based on four collinear points which have been used as the watermark embedding patterns and the referred points for watermark detection. It is not necessarily to be inversely transformed before detecting watermark positions, but can be directly detecting watermark position at once, and it is not necessarily compared with original document image without watermark. Confirmation of our document from watermark detecting can be proved directly through comparison of the existed watermark pattern.

The experiment has also shown that it can be applied for all multi-language document images, not depending on specific language attributes like some methods mentioned above which mostly focused on testing only one specific language and not thoroughly explored the possible attacks which affect the watermark robustness. This is the original step of applying the cross ratio theory for grayscale multi-language document image watermarking. For the next step we hope to improvise it to build up robustness significantly higher, especially resist the noise signal adding, rotating and brightness attacks.

REFERENCES

[1] J. T. Brassil, et al., ”Electronic Marking and Identification Techniques to Discourage Document Copying”, IEEE Journal on Selected Areas in Communications, Vol.13, No.8, Oct 1995, pp.1495-1504.

[2] S.H. Low, N.F. Maxemchuk, J.T. Brassil, and L. O’Gorman, “Document marking and identification using both line and word shifting”, Proceedings of the Fourteenth Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM’95), vol.2, 1995, pp. 853-860.

[3] Y. Kim, K. Moon, and I. Oh, “A Text Watermarking Algorithm based on Word Classification and Inter-word space Statistics”, Proceedings of the Seventh International Conference on Document Analysis and Recognition (ICDAR’03), 2003, pp. 775-779.

[4] A.M. Alattar and O.M. Alattar, “Watermarking electronic text documents containing justified paragraphs and irregular line spacing”, Proceedings of SPIE – Volume 5306, Security, Steganography, and Wateramaking of Multimedia Contents VI, 2004, pp. 685-695.

[5] D. Huang, and H. Yan, “Interword Distance Changes Represented by Sine Waves for Watermarking Text Images”, IEEE Trans. Circuits and Systems for Video Technology, Vol. 11, No. 12, pp. 1237-1245, 2001.

[6] Du Min and Zhao Quanyou, “Text Watermarking Algorithm based on Human Visual Redundancy”, AISS Journal, Advanced in Information Sciences and Service Sciences. Vol. 3, No. 5, pp. 229-235, 2011.

[7] W. Zhang, Z. Zeng, G. Pu, and H. Zhu, “ Chinese Text Watermarking Based on Occlusive Components”, IEEE, pp. 1850-1854, 2006.

[8] M.H. Shirali-Shahreza, and M. Shirali-Shahreza, “A New Approach to Persian/ Arabic Text Steganography”, IEEE International Conference on Computer and Information Science, 2006.

[9] Ranganathan Suganya, Johnsha Ahamed, Ali, Kathirvel.K & Kumar, Mohan, “Combined Text Watermarking”, International Journal of Computer Science and Information Technologies, Vol. 1 (5) , pp. 414-416, 2010.

[10] U. Topkara, M. Topkara, M. J. Atallah, “The hiding Virtues of Ambiguity: Quantifiably Resilient Watermarking of Natural language Text through Synonym Substitutiions”, In Proc. Of ACM Multimedia andSecurity Conference, 2006

[11] Samphaiboon Natthawut, and Dailey Matthew N, "Steganography in Thai text ", In Proc. of 5th International Conference on Electrical EngineeringElectronics Computer Telecommunications and Information Technology , IEEE ECTI-CON 2008, pp. 133-136, 2008.

[12] Coxeter, H. S. M. and Greitzer, S. L. “Collinearity and Concurrence.”, Geometry Revisited, Ch. 3, Math. Assoc. Amer, 1967, pp. 51-79.

[13] R. Mohr and L. Morin, “Relative Positioning from Geometric Invariants,” Proceedings of the Conference on Computer Vision and Pattern Recognition, 1991, pp. 139-144.

[14] Antonio, F. “Faster Line Segment Intersection”, Graphics Gems III, Ch. IV.6, Academic Press, 1999, pp. 199-202 and 500-501.

[15] J. Cox, M. L. Miller, and J. A. Bloom, [Digital Watermarking], Morgan Kaufmann Publishers, 2002


86

PCA Based Handwritten Character RecognitionSystem Using Support Vector Machine & Neural

NetworkRavi Sheth1, N C Chauhan2, Mahesh M Goyani3, Kinjal A Mehta4

1Information Technology Dept., A.D Patel Institute of Technology, New V V nagar-388120, Gujarat, India2Information Technology Dept., A.D.Patel Institute of Technology, New V V nagar-388121, Gujarat, India

3Computer Engineering. Dept., L.D.college of engineering, Ahmadabad, Gujarat, India4Electronics and Communication Dept., L.D. college of engineering, Ahmadabad, Gujarat, India

[email protected]

Abstract— Pattern recognition deals with categorization of inputdata into one of the given classes based on extraction of features.Handwritten Character Recognition (HCR) is one of the well-known applications of pattern recognition. For any recognitionsystem, an important part is feature extraction. A proper featureextraction method can increase the recognition ratio. In thispaper, a Principal Component Analysis (PCA) based featureextraction method is investigated for developing HCR system.PCA is a useful statistical technique that has found application infields such as face recognition and image compression, and is acommon technique for finding patterns in data of highdimension. These method have been used as features of thecharacter image, which have been later on used for training andtesting with Neural Network (NN) and Support Vector Machine(SVM) classifiers. HCR is also implemented with PCA andEuclidean distance.

Keywords: Pattern recognition, handwritten character recognition,feature extraction, principal component analysis, neural network,support vector machine, euclidean distance.

I. INTRODUCTION

andwritten character recognition is an area of patternrecognition that has become the subject of research

during the last few decades. Handwriting recognition hasalways been a challenging task in pattern recognition. Manysystems and classification algorithms have been proposed inthe past years. Techniques ranging from statistical methodssuch as PCA and Fisher discriminate analysis [1] to machinelearning like neural networks [2] or support vector machines[3] have been applied to solve this problem. The aim of thispaper is to recognize the handwritten English character byusing PCA with three different methods as mentioned above.The handwritten characters have infinite variety of stylevarying from person to person. Due to this wide range ofvariability, it is very difficult for a machine to recognize ahandwritten character; the ultimate target is still out of reach.There is a huge scope of development in the field ofhandwritten character recognition. Any future process in thefield of handwritten character recognition will able to increase

the communication between machine and men. GenerallyHCR is divided in four major parts as shown in Fig. 1[4].These phases include binarization, segmentation, featureextraction and classification. Few major problems faced whiledealing with segmented, handwritten character recognition isthe ambiguity and illegibility of the characters. The accuraterecognition of segmented characters is important for therecognition of word based on segmentation [5]. Featureextraction is most difficult part in HCR system.

Figure 1: Block diagram of HCR system.

But before recognition, the handwritten characters have to beprocessed to make them suitable for recognition. Here, weconsider the processing of entire document containingmultiple lines and many characters in each line. Our aim is torecognize characters from the entire document. Thehandwritten document has to be free from noise, skewness,etc. The lines and words have to be segmented. The charactersof any word have to be free from any slant angle so that thecharacters can be separated for recognition. By thisassumption, we try to avoid a more difficult case of cursivewriting. Segmentation of unconstrained handwritten text lineis difficult because of inter-line distance variability, base-lineskew variability, different font size and age of document [5].During the next step of this process features are extracted fromthe segmented character. Feature extraction is a very importantpart in character recognition process. Extracted feature hasbeen applied to classifiers which recognized character basedon trained features. In second section, we have described

H

Input Binarization Segmentation FeatureExtraction

ClassificationOutput


87

feature extraction method in brief and described principalcomponent analysis method. In the next session we havediscussed neural network and SVM and Euclidean distancemethodology.

II. FEATURE EXTRCTION

Any given image can be decomposed into several features.The term ‘feature’ refers to similar characteristics. Therefore,the main objective of a feature extraction technique is toaccurately retrieve these features. The term “featureextraction” can thus be taken to include a very broad range oftechniques and processes to the generation, update andmaintenance of discrete feature objects or images [6]. Featureextraction is the most difficult part in HCR system. Thisapproach gives the recognizer more control over the propertiesused in identification. Character classification task recognizesthe character which is compared with the standard value thatcomes out the learning character, and the character should becorresponded to the document image that is matching a settingdocument style in the document style setting part. Here wehave investigated and developed PCA based feature extractionmethod.

Principal component analysis

PCA is a useful statistical technique that has found applicationin fields such as face recognition and image compression, andis a common technique for finding patterns in data of highdimension [7]. It is a way of identifying patterns in data, andexpressing the data in such a way as to highlight theirsimilarities and differences. Since patterns in data can be hardto find in data of high dimension, where the graphicalrepresentation is not available, PCA is a powerful tool foranalyzing data [7].The other main advantage of PCA is that once you have foundthese patterns in the data, and you compress the data, i.e. byreducing the number of dimensions, without much loss ofinformation [7].Principal component analysis (PCA) is a mathematicalprocedure that uses an orthogonal transformation to convert aset of observations of possibly correlated variables into a setof values of uncorrelated variables called principalcomponents. The number of principal components is less thanor equal to the number of original variables. Thistransformation is defined in such a way that the first principalcomponent has as high a variance as possible (that is, accountsfor as much of the variability in the data as possible), and eachsucceeding component in turn has the highest variancepossible under the constraint that it be orthogonal to(uncorrelated with) the preceding components. Principalcomponents are guaranteed to be independent only if the dataset is jointly normally distributed. Before startingmethodology first of all it’s important to discuss followingterm which are related to PCA [7].

Eigenvector and Eigenvalues

The eigenvectors of a square matrix are the non-zero vectors that, after being multiplied by the matrix, remainproportional to the original vector (i.e., change only inmagnitude, not in direction). For each eigenvector, thecorresponding eigenvalues is the factor by which theeigenvector changes when multiplied by the matrix. Anotherproperty of eigenvectors is that even if we scale the vector bysome amount before we multiply it, we still get the samemultiple of it as a result. Another important thing to known isthat when mathematicians find eigenvectors, they like to findthe eigenvectors whose length is exactly one. This is because,as we know, the length of a vector doesn’t affect whether it’san eigenvector or not, whereas the direction does. So, in orderto keep eigenvectors standard, whenever we find aneigenvector we usually scale it to make it have a length of 1,so that all eigenvectors have the same length [7].

Steps for generating principle components ofcharacter and digit images:

Step 1: Get some data and find mean of each dataIn this work we have used our own made-up data set. Data setis nothing but handwritten character A-J and 1-5 digits whichcontains 30 samples of each character or digit. And find themean using equation 5.

n

k

kXN

M1

1(1)

Where, M=Mean, N=Total no. of i/p images, X= I/p image

Step 2: Subtract the meanFor PCA to work properly, we have subtracted the mean fromeach of the data dimensions. The mean subtracted is theaverage across each dimension (use equation 2), where M isa mean which we have calculated using equation 1.So, all the

X values have X (the mean of the x values of all the data

points) subtracted, and all the Y values have Y subtractedfrom them. This produces a data set whose mean is zero.

MX n (2)

Step 3: Calculate the covariance matrixNext step is to find out covariance matrix using equation 3.

Tkn

k

k MXMXN

M 1

1(3)

Step 4: Calculate the eigenvectors and eigenvalues of thecovariance matrixSince the covariance matrix is squared, we have calculated theeigenvectors and eigenvalues for this matrix. By this process


88

of taking the eigenvectors of the covariance matrix, we havebeen able to extract lines that characterize the data. The rest ofthe steps involve transforming the data so that it is expressedin terms of them lines.

Step 5: Choosing components and forming a feature vectorHere is where the notion of data compression and reduceddimensionality comes into it. In general, once eigenvectors arefound from the covariance matrix, the next step is to orderthem by eigenvalues, highest to lowest. This gave us thecomponents in order of significance. What needs to be donenow is you need to form a feature vector, which is just a fancyname for a matrix of vectors. This is constructed by taking theeigenvectors that you want to keep from the list ofeigenvectors, and forming a matrix with these eigenvectors inthe columns.

Step 6: Deriving the new data setThis final step in PCA, and is also the easiest. Once we havechosen the components (eigenvectors) that we wish to keep inour data and formed a feature vector, we have simply took thetranspose of the vector and multiply it on the left of theoriginal data set, transposed.

III. CLASSIFICATION METHODS

A. Neural Network

Artificial neural networks (ANN) provide the powerfulsimulation of the information processing and widely used inpatter recognition application. The most commonly usedneural network is a multilayer feed forward network whichfocus an input layer of nodes onto output layer through anumber of hidden layers. In such networks, a back propagationalgorithm is usually used as training algorithm for adjustingweights [9]. The back propagation model or multi-layerperceptron is a neural network that utilizes a supervisedlearning technique. Typically there are one or more layersof hidden nodes between the input and output nodes.Besides, a single network can be trained to reproduce allthe visual parameters as well as many networks can betrained so that each network estimates a single visualparameter. Many parameters, such as training data,transfer function, topology, learning algorithm, weightsand others can be controlled in the neural network [9].

B. Support Vector Machine

The main purpose of any machine learning technique is toachieve best generalization performance, given a specificamount of time and finite amount training data, by striking abalance between the goodness of fit attained on a giventraining dataset and the ability of the machine to achieve error-free recognition on other datasets [10].

Figure 2: neural network design

With this concept as the basis, support vectormachines have proved to achieve good generalizationperformance with no prior knowledge of the data. The maingoal of an SVM [10] is to map the input data onto a higherdimensional feature space nonlinearly related to the inputspace and determine a separating hyper plane with maximummargin between the two classes in the feature space.

Figure 3: SVM margin and support vectors [10]

Main task of SVM is to finds this hyper plane usingsupport vectors (“essential” training tuples) and margins(defined by the support vectors).Let data D be (Z1, y1), …, (Z|D|, y|D|), where Xi is the set oftraining tuples associated with the class labels yi which haseither +1 or -1 value [11]. There are uncountable (infinite)lines (hyper planes) separating the two classes but we want tofind the best one (the one that minimizes classification erroron unseen data). SVM searches for the hyper plane with thelargest margin, i.e., Maximum Marginal Hyper plane (MMH)[11]. The basic concept of SVM can be summarized as,

A separating hyper plane can be written as [11]0* CZX (4)

Where X= ..... 2,1 nXXX is a weight vector and c a

scalar (bias).

For 2-D it can be written as [11]

022110 ZXZXX , where cX 0 is additional

weight.


89

The hyper plane defining the sides of the margin:

1:1 22110 ZXZXXH , for 1iY and

1:1 22110 ZXZXXH , for 1iY

Any training tuples that fall on hyper planes H1 or H2 (i.e.,the sides defining the margin) are support vectors [11].

If data were 3-D (i.e., with three attributes), then we haveto find the best separating plane.

After we got a trained support vector machine, we use it toclassify test (new) tuples. Based on Lagrangian [11]formulation, the MMH can be rewritten as the decisionboundary.

01

)( CZZYZTD TL

iiii

(5)

Where, Yi is the class label of support vector iZ , ZTis a test

tuples, i is Lagrangian multiplier, L is the number of

support vectors.

C. Euclidean distance.

Euclidean distance is most popular technique for finding thedistance between to matrices or images. Let X, Y be two

nXm images, ).....( 2,1 nmXXXX , ).....( 2,1 nmYYYY .

Euclidian distance between X and Y is given by

mn

k

kk YXYXd1

22 )(),( (6)

IV. EXPERIMENT AND RESULTS

In this work the PCA method as discussed in section II wasimplemented in Matlab environment. The extracted data isused as features for two classifiers, namely, neural networkand support vector machine. We have prepared a real-timedataset comprising of A to J characters and digit 1 to 5. Thedata set was prepared by taking handwritings of differentpersons in a specific format. We have taken 30 samples ofeach character and digit, so finally our dataset contains total450 samples for characters A to J and digits 1 to 5. We haveapplied PCA method on this database and prepared featurematrix PC_A. At the other side for testing purpose, we havetaken 30 different images. Binarization, segmentation isapplied one by one on input image. Same feature matrix PC_Bis prepared for all the segmented characters.

A. Implementation Results of ANN & PCA based characterrecognitionPrepared PC_A matrix is given as an input to the neuralnetwork for training purpose. Similarly PC_B matrix is givento this trained network for testing purpose. The overallaccuracy of 85% was obtained for the test data using ANN.

B. Implementation Results of SVM & PCA based characterrecognitionSimilarly as we have described above, PC_A matrix is givenas an input to the SVM for training purpose. Similarly PC_Bmatrix is given to this trained network for testing purpose. Wehave used libsvm package [12] for the classification purpose.The overall accuracy of 92% was obtained for the test datausing SVM.

C. Implementation Results of Euclidean distance & PCAbased character recognitionIn this method for recognition purpose we have found theEuclidean distance between PC_A and PC_B and found theminimum index and based on this index we have found whichcharacter is recognized. PC_A and PC_B prepared using stepsthat we have discussed in previous section. We have measuredover all accuracy of this method is 90%.

D. Comparison of Recognition using ANN, SVM classifiersand Euclidean distance.

In table 1 we have listed different methods and accuracy. Asshown in table we can easily say that overall accuracy of PCA(SVM) is good compare to PCA (NN) and PCA (Euclideandistance) method. If we compare these methods on basis oftraining time then also SVM methods required less timecompare to neural network and Euclidean distance. Butdrawback of SVM methods is we have to generate SVMformat training and testing files, while in case of othermethods it’s not required. Now if we compare individualcharacter accuracy then also PCA (SVM) gives good resultcompare to other method.

Table 1: Comparison of Overall Accuracy

Sr.no Method Structure/Parameter Accuracy1 PCA(Neural

Network)[25 30 6 25] 85%

2 PCA (SVM) Kernel-RBF(Redial Bias Function)

Cost-1Gamma-1

92%

3 PCA(Euclideandistance )

- 90%


90

Table2: Comparison of Individual Character AccuracySr.no Letter or

Digit

Accuracy

Of

PCA-

SVM

(%)

Accuracy

Of

PCA-ANN

(%)

Accuracy

Of

PCA-

Euclidean

Distance

(%)

1 A 96 80 98

2 B 99 80 98

3 C 99 100 96

4 D 95 70 96

5 E 96 80 95

6 F 97 80 95

7 G 96 90 95

8 H 95 80 98

9 I 98 75 96

10 J 97 80 96

11 1 97 80 95

12 2 96 90 95

13 3 95 80 95

14 4 99 80 98

15 5 96 80 95

V. CONCLUSION

A simple and an efficient off-line handwritten characterrecognition system using a new type of feature extraction,namely, PCA is investigated. Selection of feature extractionmethod is most important factor for achieving highrecognition ratio. In this work, we have implemented PCAbased feature extraction method. With the use of thisobtained feature, we have trained the neural network aswell as SVM to recognition character. We have alsoimplemented character recognition with PCA andeuclidean distance. In the investigated work all threemethod showed the overall recognition of 85% for PCAbased neural network, 92% for PCA based SVM and 90%for PCA with Euclidean distance.

REFERENCES

[1] S.. Mori, C.Y. Suen and K. Kamamoto, “Historical review ofOCR research and development,” Proc. of IEEE, vol. 80, pp.1029-1058, July 1992.

[2] V.K. Govindan and A.P. Shivaprasad, “Character Recognition– A review,” Pattern Recognition”, vol. 23, no. 7, pp. 671-683, 1990.

[3] H.Fujisawa, Y.Nakano and K.Kurino, “Segmentation methodsfor character recognition from segmentation to document

structure analysis”. Proceeding of the IEEE, vol.80, andpp.1079-1092. 1992.

[4] Ravi K Sheth, N.C.Chauhan, Mahesh M Goyani,” AHandwritten Character Recognition Systems using CorrelationCoefficient”, V V P Rajkot, 8-9 April 2011,ISBN NO: 978-81-906377-5-6, pp 395-398..

[5] Pal, U. and B.B. Chaudhuri, “Indian script characterrecognition: A survey,” Pattern Recognition”, vol. 37, no.9,pp. 1887-1899, 2004.

[6] Ravi K Sheth, N C Chauhan, M G Goyni, Kinjal A Mehta,”Chain code based Handwritten character recognition systemusing neural network and SVM”, ICRTITCS-11, 9-10December, Mumbai.

[7] Lindsay I Smith,” A tutorial on Principal ComponentsAnalysis”, February 26, 2002

[7] Dewi Nasien, Habibollah Haron, Siti Sophiayati Yuhaniz“The Heuristic Extraction Algorithms for Freeman ChainCode of Handwritten Character”, International Journal ofExperimental Algorithms, (IJEA), Volume (1): Issue (1)

[8] S. Arora" Features Combined in a MLP-based System toRecognize Handwritten Devnagari Character”, Journal ofInformation Hiding and Multimedia Signal Processing,Volume 2, Number 1, January 2011

[9] H. Izakian, S. A. Monadjemi, B. Tork Ladani, and K.Zamanifar “Multi-Font Farsi/Arabic Isolated CharacterRecognition Using Chain Codes”, World Academy ofScience, Engineering and Technology 43 2008

[10] C. J. C. Burges, “A tutorial on support vector machines forpattern recognition. Data Mining and Knowledge Discovery”,1998, pp 121-167.

[11] Jiawei Han and Micheline Kamber ”Data Mining Conceptsand Techniques”, 2nd Edi, MK publication, 2006, pp 337-343

[12] Chih-Jen Lin,”A Library for Support Vector Machines”,http://www.csie.ntu.edu.tw/~cjlin/libsvm/


91

Web Mining Using Concept-based Pattern Taxonomy Model

Sheng-Tang Wu Dept. of Applied Informatics and

Multimedia Asia University, Taichung, Taiwan

[email protected]

Yuefeng Li Faculty of Science and Technology

Queensland University of Technology Brisbane, Australia [email protected]

Yung-Chang Lin Dept. of Applied Informatics and

Multimedia Asia University, Taichung, Taiwan

[email protected]

Abstract— In the last decade, most of the current Pattern-based Knowledge Discovery systems use statistical analyses only (e.g. occurrence or frequency) in the phase of pattern discovery. The downside of these approaches is that two different patterns may have the same statistical feature, yet one pattern of them may, however, contribute more to the meaning of text than the other. Therefore, how to extract the concept patterns from the data and then apply these patterns to the Pattern Taxonomy Model becomes the main purpose of this project. In order to analyze the concept of documents, the Natural Language Processing (NLP) technique is used. Moreover, with the support from lexical Ontology (e.g. Propbank), a novel concept-based pattern structure called “verb-argument” is defined and equipped into the proposed Concept-based Patten Taxonomy Model (CPTM). Hence, by combining the techniques from several fields (including NLP, Data Mining, Information Retrieval, and Text Mining), this paper aims to develop an effective and efficient model CPTM to address the aforementioned problem. The proposed model is examined by conducting real Web mining tasks and the experimental results show that CPTM model outperforms other methods such as Rocchio, BM25 and SVM.

Keywords- Concept Pattern; Pattern Taxonomy; Knowledge Discovery; Web Mining; Data Mining

I. INTRODUCTION

Due to the rapid growth of digital data made available in the recent years, knowledge discovery and data mining have attracted great attention with an imminent need for turning such data into useful information and knowledge. Knowledge discovery[3, 5] can be viewed as the process of nontrivial extraction of information from large databases, information that is implicitly presented in the data, previously unknown and potentially useful for users. In the whole process of knowledge discovery, this study especially focuses on the phase between the transformed data and the discovered knowledge. As a result, the most important issue is how to mine useful patterns using data mining techniques, and then transform them into valuable rules or knowledge.

The field of Web mining has drawn a lot of attention with the constant development of World Wide Web. Most of the Web content mining techniques try to use keywords as representatives to describe the concept of documents [4, 14]. In other words, the semantic of documents can be represented by a set of words frequently appeared in these

articles. Unfortunately, regardless of the feature of frequency, there are no other features such as the relation between words being even mentioned. Natural Language Processing (NLP) is one of the sub-fields of Artificial Intelligence (AI). The main object of NLP is to transform human language or text into a form that the machine can deal with. Generally speaking, the process of analyzing human language or text is very complex for a machine. Firstly, the text is broken into partitions or segments, and then each word is tagged with labels according to its part of speech (POS). Finally, the appropriate representatives are generated using parser based on the analysis of the relationship between words to describe the semantic information. Therefore, the relationship between discovered patterns can then be evaluated instead of using the statistical features of words. The integration of NLP and pattern taxonomy model (PTM)[17] can be expected to be able to find more useful patterns and construct more effective concept-based Pattern Taxonomies.

In order to extract and analyze the concept from documents, the statistical mechanism is insufficient in the information retrieval model during the phase of pattern discovering. One possible solution is to utilize the information provided by Ontology (such as WordNet, Treebank and Propbank[10]). Therefore, a novel Concept-based Pattern Taxonomy Model (CPTM) with support from NLP is proposed in this study for the purpose of overcoming the pre-mentioned problems caused by the use of statistical method.

The typical process of Pattern-based Knowledge Discovery (PKD) has two main steps. The first step is to find proper patterns, which can represent the concept or semantic, from training data using machine learning or data mining approaches. The second step is how to effectively use these patterns to meet the user’s needs. However, the relationship between patterns is ignored and not taken into account in the most cases while dealing with patterns. For example, although two words have exactly the same statistical properties, the contributions of each word are sometimes not equal.[15] Therefore, the main objective of this work is to extract and quantify the concept from documents using the proposed PTM-based method.


92

II. LITERATURE REVIEW

The World Wide Web provides rich information on an extremely large amount of linked Web pages. Such a repository contains not only text data but also multimedia objects, such as images, audio and video clips. Data mining on the World Wide Web can be referred to as Web mining which has gained much attention with the rapid growth in the amount of information available on the internet. Web mining is classified into several categories, including Web content mining, Web usage mining and Web structure mining[9].

Data mining is the process of pattern discovery in a dataset from which noise has been previously eliminated and which has been transformed in such a way to enable the pattern discovery process. Data mining techniques are developed to retrieve information or patterns to implement a wide range of knowledge discovery tasks. In recent years, several data mining methods are proposed, such as association rule mining[1], frequent itemset mining [21], sequential pattern mining [20], maximum pattern mining [6] and closed pattern mining [19]. Most of them attempt to develop efficient mining algorithms for the purpose of finding specific patterns within a reasonable period of time. However, how to effectively use this large amount of discovered patterns is still an unsolved issue. Therefore, the pattern taxonomy mechanism [16] is proposed to replace the keyword-based methods by using tree-like taxonomies as concept representatives. Taxonomy is a structure that contains information describing the relationship between sequence and sub-sequence [18]. In addition, the performance of PTM-based models is improved by adopting the closed sequential patterns. The removal of non-closed sequential patterns also results in the increase of efficiency of the system due to the shrunken dimensionality.

III. CONCEPT-BASED PTM MODEL

Concept-based PTM (CPTM) model is developed using a sentence-based framework proposed to address the text classification problems. CPTM adopts the NLP techniques by parsing and tagging each word based on its POS and generating semantic patterns as a result [15]. Different from the traditional approaches, CPTM treats each sentence as a unit rather than entire article during the phase of semantic analysis. In addition, the weight of terms (words) or phrases is estimated according to their statistical characteristics (such as the number of occurrences) in the traditional methods. However, words may have different descriptive capabilities even though they own exactly the same statistic value. Therefore, the more effective conceptual patterns that are obtained, more precisely the system can determine the concept.

How can we get more useful conceptual patterns by using NLP techniques? Below is our strategy to be described. An example sentence is stated as follows:

“We have investigated that the Data Mining field, developed for many years, has encountered the issues of low frequency and high dimensionality.”

In this sentence, we can first label the words based on their POS. The verbs, written in bold, then can be used as node in a specific structure to describe the semantic meaning of sentence. By expanding words from each verb, a structure called “Verb-Argument” [10] is formed, which is defined as a conceptual pattern in this study. The following conceptual patterns are obtained from the example sentence using the above definition:

[ARG0 We] have [TARGET investigated] [ARG1 Data Mining filed, developed for many years, has encountered the issues of low frequency and high dimensionality] [ARG1 Data Mining filed] [TARGET developed] [ARGM-TMP for many years] has encountered the issues of low frequency and high dimensionality [ARG1 Data Mining filed developed for many years] has [TARGET encountered] [ARG2 the issues of low frequency and high dimensionality]

TARGET denotes the verb in the sentence. ARG0, ARG1 and ARGM-TMP are arguments appeared around TARGET. Therefore, a set of "Verb-Argument" can be discovered while applying it to a whole document. After the above process, our proposed CPTM can then analyze these conceptual patterns in the next phase.

From the data mining point of view, the conceptual patterns are defined as two types: sequential pattern and non-sequential pattern. The definition is described as follows: Firstly, let T = t1, t2, ..., tk be a set of terms, which can be viewed as words or keywords in a dataset. A non-sequential pattern is then a non-ordered list of terms, which is a subset of T, denoted as s1, s2, ..., sm (si T). A sequential pattern, defined as S = s1, s2,...,sn (siT), is an ordered list of terms. Note that the duplication of terms is allowed in a sequence. This is different from the usual definition where a pattern consists of distinct terms.

After mining conceptual patterns, the relationship between patterns has to be defined in order to establish the pattern taxonomies. Sub-sequence is defined as follows: if there exist integers 1 i1 i2 … in m, such that a1 = bi1, a2 = bi2,..., an = bin, then a sequence = a1, a2,...,an is a sub-sequence of another sequence = b1, b2,...,bm. For example, sequence s1, s3 is a sub-sequence of sequence s1, s2, s3. However, sequence s3, s1 is not a sub-sequence of s1, s2, s3 since the order of terms is considered. In addition, we can also say sequence s1, s2, s3 is a super-sequence of s1, s3. The problem of mining sequential patterns is to find a complete set of sub-sequences from a set of sequences whose support is greater than a user-predefined threshold (minimum support).

We can then acquire a set of frequent sequential concept- patterns CP for all documents d D+, such that CP = p1, p2,…, pn. The absolute support suppa(pi) for all pi CP is obtained as well. We firstly normalize the absolute support of each discovered pattern based on the following equation:


93

10:: ,CPsupport (1)

such that

CPp ja

iai

jpsupp

psupppsupport (2)

As aforementioned, statistical properties (such as support and confidence) are usually adopted to evaluate the patterns while using data mining techniques to mine frequent patterns. However, these properties are not effective in the stage of pattern deployment and evolution[17]. The reason is the short patterns will be always the major factors affecting the performance due to their high frequency. Therefore, what we need is trying to adopt long patterns which provide more descriptive information. Another effective way is to construct a new pattern structure to gather relative information by using above-mentioned NLP techniques. Figure 1 shows the flowchart of proposed CPTM model.

Figure 1. The flow chart of CPTM Web mining model.

The pattern evolution shown in Figure 1 is used to map the pattern taxonomies into a feature space for the purpose of solving the low frequency problem of long patterns. There are two approaches proposed in order to achieve the goal: Independent Pattern Evolving (IPE) and Deployed Pattern Evolving (DPE). IPE and DPE provide different representing manners for pattern evolving as shown in Figure 2. IPE deals with patterns at the early state of individual form, instead of manipulating patterns in deployed form at the late state. DPE is constructed by compounding discovered patterns from PTM into a hypothesis space, which means this action

involves all the features including some that may come from the other patterns at the “P Level”. Therefore, both methods can be used for pattern evolution and evaluation in CPTM model.

Figure 2. Two types of Pattern Evolving.

As CPTM model is established, we apply it to the Web mining task using real Web dataset for performance evaluation. Several standard benchmark datasets are available for experimental purposes, including Reuters Corpora, OHSUMED and 20 Newsgroups Collection. The dataset used in our experiment in this study is the Reuters Corpus Volume 1 (RCV1) [13]. An RCV1 example document is illustrated in Figure 3.

Figure 3. An example RCV1 document.

RCV1 includes 806,791 English language news stories which were produced by Reuters journalists for the period between 20 August 1996 and 19 August 1997. These documents were formatted using a structured XML scheme. Each document is identified by a unique item ID and corresponded with a title in the field marked by the tag <title>. The main content of the story is in a distinct <text> field consisting of one or several paragraphs. Each paragraph is enclosed by the XML tag . In our experiment, both the “title” and “text” fields are used and each paragraph in the “text” field is viewed as a transaction in a document.


94

Figure 4 indicates the primary result of pattern analysis using Propbank scheme. The marked terms in parentheses are the verbs defined by Propbank. All of the conceptual patterns then can be generated by adopting “Verb-Argument” frame basis. At the next stage, IPE and DPE methods are used for pattern evolving. Figure 5 illustrates the output of pattern discovery using CPTM for example.

Sentence no. 1 [polic] [search] properti [own] marc dutroux chief [suspect] belgium child sex [abus] [murder] [scandal] tuesdai [found] decompos bodi two adolesc adult medic sourc Sentence no. 2 [found] two bodi [advanc] [state] decomposit sourc told [condit] anonym : : : Sentence no. 7 fate two girl [remain] mysteri Sentence no. 8 belgian girl gone [miss] recent year

Figure 4. The primary result of pattern analysis.

Figure 5. The output of pattern discovery.

In additional, the effect from the patterns derived from negative examples cannot be ignored due to their useful information[11]. There is no doubt that negative documents contain much useful information to identify ambiguous patterns during the concept learning. Therefore, it is necessary for a CPTM system to exploit these ambiguous patterns from the negative examples in order to reduce their influences. Algorithm NDP is shown as the follow.

Algorithm NDP(Ω, D+, D-) Input: A list of deployed patterns Ω; a list of positive and negative documents D+ and D-. Output: A set of term-weight pairs d. Method: 1: d ← Ø

2: τ = Threshold(D+) 3: foreach negative document nd in D- 4: if Threshold(nd) > τ 5: ∆p = dp in Ω | termset(dp) ∩ nd ≠ Ø 6: Weight shuffling for each P in ∆p 7: end if 8: foreach deployed pattern dp in Ω 9: d ← d pattern merging dp 10: end for 11: end for

IV. EXPERIMENTAL RESULTS

The effectiveness of the proposed CPTM Web mining model is evaluated by performing information filtering task with real Web dataset RCV1. The experimental results of CPTM are compared to those of other baselines, such as TFIDF, Rocchio, BM25[12] and support vector machines SVM[2, 7, 8], using several standard measures. These measures include Precision, Recall, Top-k (k = 20 in this study), Breakeven Point (b/e), Fβ-measure, Interpolated Average Precision (IAP) and Mean Average Precision (MAP).

Table 1. Contingency table.

The precision is the fraction of retrieved documents that are relevant to the topic, and the recall is the fraction of relevant documents that have been retrieved. For a binary classification problem the judgment can be defined within a contingency table as depicted in Table 1. According to the definition in this table, the measures of Precision and Recall are denoted as TP/(TP+FP) and TP/(TP+FN) respectively, where TP (True Positives) is the number of documents the system correctly identifies as positives; FP (False Positives) is the number of documents the system falsely identifies as positives; FN (False Negatives) is the number of relevant documents the system fails to identify.

The precision of top-K returned documents refers to the relative value of relevant documents in the first K returned documents. The value of K we use in the experiments is 20, denoted as "t20". Breakeven point (b/e) is used to provide another measurement for performance evaluation. It indicates the point where the value of precision equals to the value of recall for a topic.

Both the b/e and F1-measure are the single-valued measures in that they only use a figure to reflect the performance over all the documents. However, we need more figures to evaluate the system as a whole. Therefore,


95

another measure, Interpolated Average Precision (IAP) is introduced. This measure is used to compare the performance of different systems by averaging precisions at 11 standard recall levels (i.e., recall=0.0, 0.1, ..., 1.0). The 11-points measure is used in our comparison tables indicating the first value of 11 points where recall equals to 116 Experiments and Results zero. Moreover, Mean Average Precision (MAP) is used in our evaluation which is calculated by measuring precision at each relevance document first, and averaging precisions over all topics.

The decision function of SVM is defined as:

else 1

0 if 1 bxwbxwsignxh (3)

Where x is the input space; b R is a threshold and

l

i iii xyw1

for the given training data:

llii y,x,,y,x (4)

Where xi Rn and yi = +1 (-1), if document xi is labeled positive (negative). αi R is the weight of the training example xi and satisfies the following constraints:

l

i ii yi1 i 0 and 0: (5)

Since all positive documents are treated equally before the process of document evaluation, the value of αi is set as 1.0 for all the positive documents and thus the αi for the negative documents can be determined by using equation (5).

Figure 5. The result of CPTM comparing to other methods.

Figure 5 shows the interpolated 11-points in precision-recalls of CPTM comparing to other methods. It indicates that the CPTM outperforms others both at the low and high recall values. Figure 6 also reveals the similar result that CPTM has better performance in all measures comparing to those of other approaches including data mining method and traditional probability method.

Figure 6. The comparing results shown in several standard measures.

V. CONCLUSION

In general, a significant amount of patterns can be retrieved by using the data mining techniques to extract information from Web data. However, how to effectively use these discovered patterns is still an unsolved problem. Another typical issue is that only the statistic properties (such as support and confidence) are used while evaluating the effectiveness of patterns. The useful information hidden in the relationship between patterns is still not utilized. The drawback of traditional methods is that the longer patterns usually lead to a lower measure of support, resulting in the low performance. Therefore, NLP techniques can be adopted for help to define and generate the conceptual patterns. In this paper, a novel concept-based PTM Web mining model CPTM is then proposed. CPTM provides effective solutions to address the aforementioned problems by integrating NLP techniques and lexical ontology. The experimental results show that CPTM model outperforms other methods such as Rocchio, BM25 and SVM.

REFERENCES

[1] R. Agrawal, T. Imielinski, and A. Swami, "Mining association rules between sets and items in large database," in ACM-SIGMOD, 1993, pp. 207-216.

[2] C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, pp. 273-297, 1995.

[3] V. Devedzic, "Knowledge discovery and data mining in databases," in Handbook of Software Engineering and Knowledge Engineering. vol. 1, S. K. Chang, Ed., ed: World Scientific Publishing Co., 2001, pp. 615-637.

[4] L. Edda and K. Jorg, "Text categorization with support vector machines. how to represent texts in input space?," Machine Learning, vol. 46, pp. 423-444, 2002.

[5] W. J. Frawley, G. Piatetsky-Shapiro, and C. J. Matheus, "Knowledge discovery in databases: an overview," AI Magazine, vol. 13, pp. 57-70, 1992.

[6] K. Gouda and M. J. Zaki, "Genmax: An efficient algorithm for mining maximal frequent itemsets," Data Mining and Knowledge Discovery, vol. 11, pp. 223-242, 2005.


96

[7] T. Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization," in ICML, 1997, pp. 143-151.

[8] T. Joachims, "Transductive inference for text classification using support vector machines," in ICML, 1999, pp. 200-209.

[9] C. Kaur and R. R. Aggarwal, "Web Mining Tasks and Types: A Survey," IJRIM, vol. 2, pp. 547-558, 2012.

[10] P. Kingsbury and M. Palmer, "Propbank: the next level of Treebank," in Treebanks and Lexical Theories, 2003.

[11] Y. Li, X. Tao, A. Algarni, and S.-T. Wu, "Mining Specific and General Features in Both Positive and Negative Relevance Feedback," in TREC, 2009.

[12] S. E. Robertson, S. Walker, and M. Hancock-Beaulieu, "Experimentation as a way of life: Okapi at trec," Information Processing and Management, vol. 36, pp. 95-108, 2000.

[13] T. Rose, M. Stevenson, and M. Whitehead, "The reuters corpus volume1- from yesterday's news to today's language resources," in Inter. Conf. on Language Resources and Evaluation, 2002, pp. 29-31.

[14] F. Sebastiani, "Machine learning in automated text categorization," ACM Computing Surveys, vol. 34, pp. 1-47, 2002.

[15] S. Shehata, F. Karray, and M. Kamel, "A concept-based model for enhancing text categorization," in KDD, 2007, pp. 629-637.

[16] S.-T. Wu, Y. Li, and Y. Xu, "An effective deploying algorithm for using pattern-taxonomy," in iiWAS05, 2005, pp. 1013-1022.

[17] S.-T. Wu, Y. Li, and Y. Xu, "Deploying approaches for pattern refinement in text mining," in ICDM, 2006, pp. 1157-1161.

[18] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, "Automatic pattern-taxonomy extraction for web mining," in IEEE/WIC/ACM International Conference on Web Intelligence, 2004, pp. 242-248.

[19] X. Yan, J. Han, and R. Afshar, "Clospan: mining closed sequential patterns in large datasets," in SIAM Int. Conf. on Data Mining (SDM03), 2003, pp. 166-177.

[20] C.-C. Yu and Y.-L. Chen, "Mining sequential patterns from multidimensional sequence data," IEEE Transactions on Knowledge and Data Engineering, vol. 17, pp. 136-140, 2005.

[21] S. Zhang, X. Wu, J. Zhang, and C. Zhang, "A decremental algorithm for maintaining frequent itemsets in dynamic databases," in International Conference on Data Warehousing and Knowledge Discovery (DaWaK05), 2005, pp. 305-314.


97

A New Approach to Cluster Visualization Methods Based on Self-Organizing Maps

Marcin Zimniak Department of Computer Science

Chemnitz, University of Technology Chemnitz, Germany

[email protected]

Johannes Fliege Department of Computer Science


[email protected]

Wolfgang Benn Department of Computer Science


[email protected]

Abstract —The Self-Organizing Map (SOM) is one of the artifi-cial neural networks that perform vector quantization and vector projection simultaneously. Due to this characteristic, a SOM can be visualized twice: through the output space, which means considering the vector projection perspective, and through the input data space, emphasizing the vector quantiza-tion process.

This paper aims at the idea of presenting high-dimensional clusters that are ‘disjoint objects’ as groups of pairwise disjoint simple geometrical objects – like 3D-spheres for instance. We expand current cluster visualization methods to gain better overview and insight into the existing clusters. We analyze the classical SOM model, insisting on the topographic product as a measure of degree of topology preservation and treat that measure as a judge tool for admissible neural net dimension in dimension reduction process. To achieve better performance and more precise results we use the SOM batch algorithm with toroidal topology. Finally, a software solution of the approach for mobile devices like iPad is presented.

Keywords-Self- organizing maps (SOM); topology preservation; clustering; data-visualisation; dimension reduction; data-mining

I. INTRODUCTION Neural maps are biologically inspired data representa-

tions that combine aspects of vector quantization with the property of function continuity. Self-Organizing Maps (SOMs) have been successfully applied as a tool for visuali-zation, for clustering of multidimensional datasets, for image compression, and for speech and face recognition.

A SOM is basically a method of vector quantization, i.e. this technique is obligatory in a SOM. Regarding dimension-ality reduction, a SOM models data in a nonlinear and dis-crete way by representing it in a deformed lattice. The map-ping, however, is given explicitly and well defined only for the prototypes and in most cases only offline algorithms implement SOMs. For our purpose we concern the so-called ‘batch’ version of the SOM which can easily be derived from

the basic model: instead of updating prototypes one by one, they are all moved simultaneously at the end of each run, as in a standard gradient descent. In order to reduce border effects in the neural network we use a toroidal topology. For more details concerning the degree of organization we refer the reader to [1]. Applying this approach, we work with a so-called well-organized neural grid. One of our main tasks concerning the application of Self-Organizing Maps is to implement a suitable mapping procedure that should result in a topology preserving projection of high-dimensional data onto a low dimensional lattice.

In our project we consider only three admissible dimen-sions of output space, namely 𝑑! = 1,2,3 for a given neu-ronal grid 𝐴. However, in general, the choice of the dimen-sion for the neural net does not guarantee to produce a topol-ogy-preserving mapping. Thus, the interpretation of the re-sulting map may fail. Therefore, we introduce the very im-portant concept of a topologically preserving mapping, which means that similar data vectors are mapped onto the same or neighbored locations in the lattice and vice versa.

In this paper we propose a new concept of cluster visuali-zation; we illustrate clusters as disjoint objects in pairs of simple geometrical objects like spheres in 3D centered at best matching units (BMUs) coordinates within a neural network of admissible dimension.

Our paper is organized as follows: in section 2 we give a precise mathematical description of SOM including the to-pology preservation measure (topographic product) as a measure for an admissible dimension of the output space. In section 3 we present existing methods of cluster visualization followed by the extension of a graphical visualization meth-od for providing a new solution. In section 4 we demonstrate a software realization approach for our new visualization concept. Finally, we outline our conclusion and emerging further work in section 5.


98

II. MATHEMATICAL BACKGROUND OF THE SOM One of the powerful approaches to adopt our cluster con-

siderations within SOM is the application of Self-Organizing Maps to implement a suitable mapping procedure, which should result in a topology-preserving projection of the high-dimensional data onto a low dimensional lattice. In most applications a two- or three-dimensional SOM lattice is the common choice of lattice structure because of its easy visual-ization. However, in general, this choice does not guarantee to produce a topology-preserving mapping. Thus, the inter-pretation of the resulting map may fail. Topology preserving mapping means that similar data vectors are mapped onto the same or neighbored locations in the lattice and vice versa.

A. SOM Algorithm and Toplogy Preservation Within the framework of dimensionality reduction, SOM

can be interpreted intuitively as a kind of nonlinear but dis-crete PCA. Formally, Self-organizing maps (SOM) as a spe-cial kind of artificial neural network map project data from some (possibly high-dimensional) input space 𝑉 ⊆ ℜ!! onto a position in some output space (neural map) 𝐴, such that a continuous change of a parameter of the input data should lead to a continuous change of the position of a localized excitation in the neural map. This property of neighborhood preservation depends on an important feature of the SOM, its output space topology, which has to be predefined before the learning process to be started. If the topology of 𝐴 (i.e. its dimensionality and its edge length ratios) does not match that of the data shape, neighborhood violations will occur. This can be written in a formal way by defining the output space positions as 𝑟 = 𝑖!, 𝑖!, 𝑖!, . . , 𝑖!! , 1 < 𝑖! < 𝑛! with 𝑁 = 𝑛!×𝑛!×𝑛!. .×𝑛! where 𝑛! , 𝑘 = 1. .𝑚 represents the dimension of 𝐴 (i.e. length of the edge of the lattice) in kth-direction. In general, other arrangements are possible, e.g. the definition of a connectivity matrix. Nevertheless, we consider hypercubes in our project. We associate a weight vector or pointer 𝑤! with each neuron 𝑟 ∈ 𝐴 in 𝑉.

The mapping Ψ!→! is realized by rule: the winner takes it all (WTA). It updates only one prototype (the BMU) at each presentation of a datum. WTA is the simplest rule and in-cludes the classical competitive learning as well as the fre-quency-sensitive competitive learning

Ψ!→!: 𝑣 ↦ 𝑠 = argmin!"# 𝑣 − 𝑤! (1)

where the corresponding reverse mapping is defined as Ψ!→!: 𝑟 ↦ 𝑤!. These two functions together determine the map

𝑀 = Ψ!→!,Ψ!→! (2)

realized by the SOM network. All data points 𝑣 ∈ ℜ! that are mapped onto the neuron 𝑟 make up its receptive field Ω!! . The masked receptive field of neuron 𝑟 is defined as the intersection of its receptive field with 𝑉 namely

Ω! = 𝑣 ∈ 𝑉: 𝑟 = Ψ!→!(𝑣) . (3)

Therefore, the masked receptive fields Ω! are closed sets. All masked receptive fields form the Voronoi tessellation (diagram) of 𝑉. If the intersection of two masked receptive fields Ω! ,Ω!! is non-vanishing (Ω! ∩ Ω!! ≠ ∅), we call both of them Ω!Ω!! neighbored. The neighborhood relations form

a corresponding graph structure in 𝐺! in 𝐴: two neurons are connected in 𝐺! if and only if their masked receptive fields are neighbored. The graph 𝐺! is called the induced Delau-nay-graph. For further details we refer the reader to [2]. Due to the bijective relation between neurons and weight vectors, 𝐺! also represents the Delaunay graph of the weights (Fig. 1).

To achieve the map 𝑀, the SOM adapts the pointer posi-tions during the presentation of a sequence of data points 𝑣 ∈ 𝑉 selected from a data distribution 𝑃(𝑉) as follows:

∆𝑤! = 𝜀 ∙ ℎ!"(𝑣 − 𝑤!), (4)

where 0 ≤ 𝜀 ≤ 1 denotes learning rate, and ℎ!" is the neighborhood function, usually chosen to be of Gaussian shape:

ℎ!" = exp − !!! !

!!!. (5)

We note that ℎ!" depends on the best matching neuron defined in (1).

Topology preservation in SOMs is defined as the preser-vation of the continuity of the mapping from the input space onto the output space. More precisely, it is equivalent to the continuity of 𝑀 (in the mathematical topological sense) be-tween the topological spaces with a properly chosen metric in both 𝐴 and 𝑉. Thus, to indicate the topographic violation we need metric and topological conditions, e.g. in Fig. 2 a) a perfect topographic map is indicated, whereas in 2 b) topog-raphy is violated. The pair of nearest neighbors 𝑤!,𝑤! is mapped onto the neurons 1 and 3, which are not nearest neighbors. The distance relation between both is inverted as well: 𝑑! 𝑤!,𝑤! > 𝑑!(𝑤!,𝑤!) but 𝑑! 1,2 < 𝑑!(1,3) . Thus, topological and metric conditions are violated. For detailed considerations we refer to [3]. The topology preserv-ing property can be used for immediate evaluations of the resulting map, e.g. for interpretation as a color space which we applied in Sec. 3.

As we already pointed out in the introduction, violations of the topographic mapping may raise false interpretations. Several approaches were developed to measure the degree of topology preservation for a given map. We chose the topo-graphic product 𝑃, which relates the sequence of input space neighbors to the sequence of output space neighbors for each neuron. Instead of using the Euclidean distances between the

Delaunaytriangulaton

Voronoi diagram

Figure 1. The Delaunay triangulation and Voronoi diagram are dual to each other in the graph theoretical sense.


99

weight vectors, this measure applies the respective distances 𝑑!!(𝑤! ,𝑤!!) of minimal path lengths in the induced Delau-nay graph 𝐺! of 𝑤! . During the computation of 𝑃 the se-quences 𝑛!! (𝑟) of the mth neighbors of 𝑟 in 𝐴 and 𝑛!! (𝑟), describing the mth neighbor of 𝑤! have to be determined for each node 𝑟. These sequences and further averaging over neighborhood orders 𝑚 and nodes 𝑟 finally lead to

P = 1N 2 − N

12mm=1

N−1

∑r∑ log

dGV wnlA (r )( )

dGV wnlV (r )( )

⋅dV r,nl

A (r)( )dV r,nl

V (r)( )l=1

m

∏%

&

''

(

)

**

. (6)

The sign of 𝑃 approximately indicates the relation be-tween the input and output space topology whereas 𝑃 < 0 corresponds to a too low-dimensional input space, 𝑃 ≈ 0 indicates an approximate match, and 𝑃 > 0 corresponds to a too high-dimensional input space.

In the definition of 𝑃, topological and metric properties of a map are mixed. This mixture provides a simple mathe-matical characterization of what 𝑃 actually measures. How-ever, for the case of perfect preservation of an order relation, identical sequences 𝑛!! (𝑚) and 𝑛!! (𝑚) result in 𝑃 taking on the value 𝑃 = 0.

Application of SOMs to very high-dimensional data can produce difficulties that may result from the so-called ’curse of dimensionality’: the problem of sparse data caused by the high data dimensionality. We refer to approach proposed by KASKI in [4].

B. Application of the Topographic Product involving real-world Data Data set in case of speech feature vectors (𝐷! = 19, di-

mension of input space) obtained from several speakers utter-ing the German numerals1. We see (Fig. 3) in that case topo-graphic product single out 𝑑! ≈ 3.

C. Batch Version of Kohonen’s Self-Organizing Map Depending on the application, data observations may ar-

rive consecutively or alternatively, the whole data set may be available at once. In the first case, an online algorithm is applied. In the second case, an offline algorithm suffices. More precisely, offline or batch algorithms cannot work until the whole set of observations is known. On the contrary, online algorithms typically work with no more than a single

1 The data is available at III. Physikalisches Institut Goettingen;

previously investigated in [8], [9].

observation at a time. For most methods the choice of the model largely orients the implementation towards one or the other type of algorithm. Generally, the simpler the model, the more freedom is left to the implementation. In our project we apply the batch version of the SOM described in the follow-ing algorithm:

1) Define the lattice by assigning the low-dimensional coordinates of the prototypes in the embedding space.

2) Initialize the coordinates of the prototypes in the data space.

3) Assign to 𝜀 and to the neighborhood function ℎ!" their scheduled values for epoch q.

4) For all points 𝑣 in the data set, compute all prototypes as in (1) and update them according to (4).

5) Continue with step 3 until convergence is reached (i.e. updates of the prototypes become negligible).

III. DATA MINING WITH SOM If a proper SOM is trained according to the above men-

tioned criteria several methods for representation and post-processing can be applied. In case of a two dimensional lat-tice of neurons many visualization approaches are known. The most common method for visualization of SOMs is to project the weight vectors in the first dimension of the space spanned by the principle components of the data and con-necting these units to the respective nodes in the lattice that are neighbored. However, if the shape of the SOM lattice is hypercubical there are several more ways to visualize the properties of the map. For our purpose we focus only on those that are of interest in our application. An extensive overview can be found in [6].

A. Current Cluster Visualization Methods of SOM An interesting evaluation is the so-called U-matrix intro-

duced by [5] (Fig. 4). The elements of the matrix U represent the distances between the respective weight vectors and are neighbors in the neural network A. Matrix U can be used to determine clusters within the weight vector set and, hence, within the data space. Assuming that the map is topology preserving, large values indicate cluster boundaries. If the lattice is a two-dimensional array the U-matrix can easily be viewed and gives a powerful tool for cluster analysis. Anoth-er visualization technique can be used if the lattice is three-dimensional. The data points then can be mapped onto neu-ron r which can be identified by the color combination red, green and blue (Fig. 5) assigned to the location r. In such a way we are able to assign a color to each data point accord-

a 2 3 41

w1 w2 w3 w4

2 3 41

w1 w2 w3 w4

b

output space

output space

input space

input space

Figure 3. Values of the topographic product for the speech data. Figure 2. Metric vs. topological conditions for map topography.

1 2 3 4

0.1

0

-0.1

-0.2

-0.3

0.2

dA


100

ing to equation (1) and similar colors will encode groups of input patterns that were mapped close to one another in the lattice A. It should be emphasized that for a proper interpre-tation of this color visualization, as well as for the analysis of the U-matrix, topology preservation of the map M is a strict requirement. Furthermore, we should pay regard to the fact that the topology preserving property of M must be proven prior to any evaluation of the map.

B. A new Concept for Cluster Visualization We provide a new idea in order to get insight of visualiz-

ing clusters as disjoint objects in pairs of simple objects like 3D spheres, independently of the resulting admissible output space. In this manner, additionally to existing visualization methods, we are able to distinguish and illustrate the “vol-ume” of each cluster obtained by the radius of the construct-ed spheres.

In the following steps we describe our visualization ap-proach in further detail. At the very beginning the input data set is predefined as clustered data set after the GNG [11] learning process is finished. Afterwards the batch version of the SOM algorithm is performed whereas all BMUs are computed for all input clusters respectively. Finally, the dimension reduction of the input space is achieved by utiliz-ing the topographic product as a judgment tool for an admis-sible output space.

Affine spaces provide a better framework for doing ge-ometry. In particular, it is possible to deal with points, curves, surfaces, etc., in an intrinsic manner, i.e., inde-pendently of any specific choice of a coordinate system. Naturally, coordinate systems have to be chosen to finally carry out computations, but one should learn to resist the temptation to resort to coordinate systems until it becomes necessary. So, we treat the admissible output space as an affine space in intrinsic manner where no special origin is predefined. We set the origin neuron numbered with 1 (Fig. 6). For simplicity, in the neuronal grid, distances between all directly neighboring neurons are set to 1.

Let 𝐶! denote the power of a cluster 𝐶! (the number of entities for a given 𝐶!). We are aiming to construct a presen-tation space in homogenous form in the sense of space di-mension for any case of 𝑑! . We calculate the radius of spheres2 centered on corresponding BMUs as follows:

2 In our considerations we use the term of spheres for all cases of 𝑑! regarding the topology amongst them.

ri = 0.5 ⋅ 1−Ci

Cjj∑

$

%

&&&

'

(

)))

. (7)

Obviously, spheres constructed in that manner in the out-put space of dimension 𝑑! do not have any point in common. In our calculations we apply a parametric equation of a sphere. In order to keep the presentation space homogenous to dimension 3 (Fig. 7), with no relative topology at pres-ence, we extend the output space as described below.

In case of 𝑑! = 3 we perform no operation, since no ex-tension is needed (identity map). In case of 𝑑! = 2

𝑟 cos 𝑥 , 𝑟 sin 𝑥 ↦ 𝑟 cos 𝑥 , 𝑟 sin 𝑥 ,± 𝑟!! − 𝑟! , (8)

where 0 ≤ 𝑥 < 2𝜋, 0 ≤ 𝑟 ≤ 𝑟!, needs to be applied. Fi-nally, in case of 𝑑! = 1 (functions composition) the applica-tion of

𝑟 ↦ 𝑟 cos 𝑥 , 𝑟 sin 𝑥 ↦ 𝑟 cos 𝑥 , 𝑟 sin 𝑥 ,± 𝑟!! − 𝑟! , (9)

where 0 ≤ 𝑥 < 2𝜋, 0 ≤ 𝑟 ≤ 𝑟! , becomes necessary. In our method we propose to describe clusters as disjoint spheres’ centers located at every BMUs position respectively after the batch SOM algorithm is finished. In any cases of topology preservation criterion results (1, 2 or 3 - admissible dimension of neuronal net, after dimension reduction pro-cess) we are able to construct a group of disjoint spheres in 3D.

C. Comments The novelty of our approach is to present clusters via

suitable separated object – spheres in homogenous 3D presentation space. In contrast to the k-clustering concept [12] we apply modern Growing Neuronal Gas unsupervised learning process returning separated objects in form of a clustered probability distribution for a given input data set of possibly high dimension. Finally, we link this concept with Self-Organizing Maps framework in order to illustrate clus-ters in space of admissible reduced dimension. For compre-

Figure 5. Cluster visualization via U-Matrix. Figure 4. Representation of positions of neurons in the three-

dimensional neuron lattice A as a vector c=(r,g,b) in the color space C, where r, g, b denote the intensity of the colors red green and blue. Thus,

colors are assigned to categories (winner neurons).


101

hensive source on dimension reduction of high-dimensional data the reader is referred to [13].

IV. VISUALIZING CLUSTER INFORMATION VIA SOM ON MOBILE DEVICES

The following example will describe a realization of a SOM-based cluster visualization technique for information visualization, thus, displaying a semantic-based database index cluster structure on mobile platforms. The aim was to visually represent the internal database index organization structure intuitively to a user. Our realization had to focus on different requirements.

A. Requirements The implementation of a SOM-based cluster visualization

platform to display a database index’ cluster data on mobile entities had to fulfill certain requirements. First of all, the requirement to run our application on mobile devices with potentially low computational power was a challenge. Se-cond, the functionality of our application had to be ensured using any type of network connection provided by the mobile device also including mobile networks with low bandwidth. As a functional requirement, it was requested to visualize clusters as spheres, where the number of data tuples con-tained in each cluster should be presented implicitly.

B. Requirements Analysis Due to computational limitations of mobile platforms, the

possibility of running SOM transformations on a mobile device could not be regarded as feasible. Thus, a separation of our desired application into a client and a server part was regarded as the most promising solution. Based on the result of the analysis of our first requirement, we did not regard it as suitable to transmit all cluster data required for SOM computations. We decided to transmit only the results of the SOM process since this also seemed to guarantee a smaller data amount compared to the SOM’s input data. Further-more, we intended to reduce possible error causes with this decision regarding the possible necessity of different imple-mentations for different mobile platforms. Finally, the re-

quirements analyses led us to centralize computational effort, thus, utilizing the application on a mobile device only as interface for visualization and user interaction.

C. Realization We separated our application into two parts: a server ap-

plication, and a client application for mobile devices. As described in our requirements analysis we decided to central-ize computation effort on the server side, thus, realizing SOM computations there. For realizing the SOM computa-tions we made use of SOM Toolbox contained in Matlab® by building a bridge to C++ for enabling our server applica-tion to run the necessary SOM transformations easily. Using this tool chain allowed us to prepare the cluster data for visu-alization by dimension reduction through SOM efficiently.

The mobile application was designed to run on mobile platforms with touch interfaces but comparably low compu-tational resources. An example screen shot of our user inter-face is given in Fig. 8 showing clusters, i.e. spheres, that were transformed from n-dimensional space to 3-dimensional output space using SOM.

As shown in Fig. 8, the spheres are of different size. We decided to use a spheres size to implicitly visualize the num-ber of data tuples contained in its according cluster. For de-termining a sphere’s actual size we put the number of data tuples in a cluster into relation to the number of data tuples contained in all clusters. To prevent the spheres from inter-secting each other we decided to limit their size by regarding the minimum Euclidian distance δmin of each pair of spheres amongst all spheres into consideration. At a first glance we took the radius of a sphere into consideration for determining its size by making the radius proportionally dependent of the number of data tuples contained in the underlying cluster. Nevertheless, data is contained in a cluster, which leads us to the volume of spheres. Therefore, we decided to represent the number of tuples in a cluster by making a sphere’s volume dependent on these. Thus, we were able to implicitly repre-sent the data amount contained in a cluster.

Our example was based on a data set with 998 dimen-sions in input space.

D. Capabilities of our Example The software system presented in our example is capable

of visualizing information on the clustering state of a seman-tic based database index allowing the user to navigate through the index’ cluster structure. This may be performed either by using the visualization feature of the index’ hierar-chy or by utilizing the realized SOM-based visualization feature. In future development our aim is to present more

1 2 3

4 5 6

7 8 9

10 11 12

19 20 21

13 14 15

16 17 18

22 23 24

25 26 27

r1

BMU1BMU2

r2

r3

BMU3

r1>r3>r2; r1<0.5

Figure 6. Neurons and best matching units in a chosen admissible output space with the origin neuron intrinsically numbered with 1.

dA=3

dA=2

dA=1

Presentation SpaceAdmissible Output Space

Figure 7. Expansion of output space A to presentation space depending on admissible output space dimension dA.


102

detailed information and to increase user interaction possibil-ities potentially influencing the clustering process.

V. CONCLUSION AND FURTHER WORK In our paper we have deeply described SOM from the

mathematical point of view, giving precise description for that kind of neuronal nets, emphasizing the role of topo-graphic product as a criterion for admissible neuronal net dimensions in dimension reduction process.

We have proposed a new illustration method for cluster visualization, linking existing visualization methods of colors (RGB) with methods of separated objects like 3D-spheres, providing better understanding of clusters as joint objects. Finally the software realization approach has been presented.

In our further research we will consider a data-driven ver-sion of SOM, so called growing SOM (GSOM). Its output is a structure adapted hypercube A, produced by adaptation of both the dimensions and the respective edge length ratios of A during the learning, in addition to the usual adaptation of the weights. In comparison to the standard SOM, the overall dimensionality and the dimensions along the individual di-rections in A are variables that evolve into the hypercube structure most suitable for the input space topology.

REFERENCES [1] G. Andreu, A. Crespo, and J. M. Valiente. Selecting the toroidal self-

organizing feature maps (tsofm) best organized to object recognition. In Proceedings of International Conference on Artificial Neural Networks, Houston (USA), volume 1327 of Lecture Notes in Computer Science, pages 1341–1346, June 1997.

[2] T. Martinetz and K. Schulten, “Topology representing networks”. Neural Networks, vol. 7, no. 3, pp. 507–522, 1994.

[3] T. Villmann, R. Der, M. Herrmann, and T. Martinetz. Topology Preservation in Self–Organizing Feature Maps: Exact Definition and Measurement. IEEE Transactions on Neural Networks, 8(2):256–266, 1997.

[4] S. Kaski, J. Nikkilä, and T. Kohonen. Methods for interpreting a self-organized map in data analysis. In Proc. Of European Symposium on Arti cial Neural Networks (ESANN’98), pages 185–190, Brussels, Belgium, 1998. D facto publications.

[5] A. Ultsch. Self organized feature maps for monitoring and knowledge aquisition of a chemical process. In S. Gielen and B. Kappen, editors, Proc. ICANN’93, Int. Conf. on Artifcial Neural Networks, pages 864–867, London, UK, 1993. Springer.

[6] J. Vesanto. SOM-based data visualization methods. Intelligent Data Analysis, 3(7):123–456, 1999.

[7] T. Kohonen. Self-Organizing Maps. Springer, Berlin, Heidelberg, 1995. (Second Extended Edition 1997).

[8] H.U. Bauer, and K.Pawlzik, Quantifying the neighborhood preservation of self-organizing feature maps, IEE Trans. Of Neur. Netw. 3 (4), 570-579 (1992)

[9] T. Gramss, H.W. Strube, Recognition of Isolated Words Based on Psychoacoustics and Neurobiology. Speech. Comm. 9, 35-40, 1990.

[10] T. Kohonen, “Self organization and associative Memory”, 2nd Edition, Berlin, Germany: Springer-Verlag, 1988.

[11] Fritzke, B. (1995a). A growing neural gas network learns topologies. In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages 625-632. MIT Press, Cambridge MA.

[12] Preparata and Shamos, “Computational geometry, an introduction”, Springer-Verlag, 1985.

[13] John A. Lee, Michel Verleysen, Nonlinear Dimensionality Reduction, Springer, 2007.

C113

C110

x

y

z

C112

C107

C

C108

Figure 8. Visualization of clusters in three-dimensional output space after applying SOM.


103

Detecting Source Topics using Extended HITS

Mario Kubek Faculty of Mathematics and Computer Science

FernUniversität in Hagen Hagen, Germany


Herwig Unger Faculty of Mathematics and Computer Science

FernUniversität in Hagen Hagen, Germany


Abstract—This paper describes a new method to determine the sources of topics in texts by analysing their directed co-occurrence graphs using an extended version of the HITS algorithm. This method can also be used to identify characteristic terms in texts. In order to obtain the needed directed term relations to cover asymmetric real-life relationships between concepts it is described how they can be calculated by statistical means. In the experiments, it is shown that the detected source topics and characteristic terms can be used to find similar documents and those that mainly deal with the source topics in large corpora like the World Wide Web. This approach also offers a new way to follow topics across multiple documents in such corpora. This application will be elaborated on as well.

Keywords-Source topic detection; Co-occurrence analysis; Extended HITS; Text Mining; Web Information Retrieval

I. INTRODUCTION AND MOTIVATION The selection of characteristic and discriminating

terms in texts through weights, often referred to as keyword extraction or terminology extraction, plays an important role in text mining and information retrieval. In [1] it has been pointed out, that graph-based methods are well suited for the analysis of co-occurrence graphs e.g. for the purpose of keyword extraction and deliver comparable results to classic approaches like TF-IDF [2] and difference analysis [3]. Especially the proposed extended version of the PageRank algorithm, that takes into account the strength of the semantic term relations in these graphs, is able to return such characteristic terms and does not rely on reference corpora. In this paper, the authors extend this approach by introducing a method to not only determine these keywords, but to also determine terms in texts that can be referred to as source topics. These terms strongly influence the main topics in texts, yet are not necessarily important keywords themselves. They are helpful when it comes to applications like following topics to their roots by analysing documents that cover them primarily. This process can span several documents.

In order to automatically determine source topics of single texts, the authors present the idea to apply an extended version of the HITS algorithm [4] on directed co-occurrence graphs for this purpose. This solution will not only return the most characteristic terms of texts like the extended PageRank algorithm, but also the source topics in them. Usually, co-

occurrence graphs are undirected which is suitable for the flat visualisation of term relations and for applications like query expansion via spreading activation techniques. However, real-life associations are mostly directed, e.g. an Audi a German car but not every German car is an Audi. The association of Audi with German car is therefore much stronger than the association of German car with Audi. Therefore, it actually makes sense to deal with directed term relations.

The HITS algorithm [4], which was initially designed to evaluate the relative importance of nodes in web graphs (which are directed), returns two list of nodes: authorities and hubs. Authorities that are also determined by the PageRank algorithm [5], are nodes that are often linked to by many other nodes. Hubs are nodes that link to many other nodes. Nodes are assigned both a score for their authority and their hub value. For undirected graphs the authority and the hub score of a node would be the same, which is naturally not the case for the web graph. Referred to the analysis of directed co-occurrence graphs with HITS, the authorities are the characteristic terms of the analysed text, whereas the hubs represent its source topics. Therefore, it is necessary to describe the construction of directed co-occurrence graphs before getting into the details of the method to determine the source topics and its applications.

Hence, the paper is organised as follows: the next section explains the methodology used. In this section it is outlined, how to calculate directed term relations from texts by using co-occurrence analysis in order to obtain directed co-occurrence graphs. Afterwards, section three presents a method that applies an extended version of the HITS algorithm that considers the strength of these directed term relations to calculate the characteristic terms and source topics in texts. Section four focuses on the conducted experiments using this method. It is also shown that the results of this method can be used to find similar and related documents in the World Wide Web. Section five concludes the paper and provides a look at options to employ this method in solutions to follow topics in large corpora like the World Wide Web.

II. METHODOLOGY Well known measures to gain co-occurrence significance values on sentence level are for instance


104

the mutual information measure [6], the Dice [7] and Jaccard [8] coefficients, the Poisson collocation measure [9] and the log-likelihood ratio [10]. While these measures return the same value for the relation of a term A with another term B and vice versa, an undirected relation of both terms often does not represent real-life relationships very well as it has been pointed out in the introduction. Therefore, it is sensible to deal with directed relations of terms. To measure the directed relation of term A with term B, which can also be regarded as the strength of the association of term A with term B, the following formula of the conditional relative frequency can be used, whereby is the number of times term A and B co-occurred in the text on sentence level and

is the number of sentences term A occurred in:

(1)

Often, this significance differs greatly in regards of the two directions of the relations when the difference of the involved term frequencies is high. The association of a less frequently occurring term A with a frequently occurring term B could reach a value of 1.0 when A always co-occurs with B, however B's association with A could be almost 0. This means, that B's occurrence with term A is insignificant in the analysed text. That is why it is sensible to only take into account the direction of the dominant association (the one with the higher value) to generate a directed co-occurrence graph for the further considerations. However, the dominant association should be additionally weighted. In the example above, term A's association with B is 1.0. If another term C, which more frequently appears in the text than A, also co-occurs with term B each time it appears, then its association value with B would be 1.0, too. Yet, this co-occurrence is more significant than the co-occurrence of A with B. An additional weight that influences the association value and considers this fact could be determined by

• the (normalised) number of sentences, in which both terms co-occur or

• the (normalised) frequency of the term A. The normalisation basis could be the maximum number of sentences, which any term of the text has occurred in.

The association Assn of term A with term B can then be calculated using the second approach by:

(2)

Hereby, is the maximum number of sentences, any term has occurred it. A thus obtained relation of term A with term B with a high association strength can be interpreted as a recommendation of A for B. Relations gained by this means are more specific than undirected relations between terms because of their direction. They resemble a hyperlink on a website to another one. In this case however, it has not been manually and explicitly set and it carries an additional weight that indicates the strength of the term association. The set of all such relations obtained from a text represents a directed co-occurrence graph. The next step is now to analyse such graphs with an extended version HITS algorithm that regards these association strengths in order to find the source topics in texts. Therefore, in the next section the extension of the HITS algorithm is explained and a method that employs it for the analysis of directed co-occurrence graphs is outlined.

III. THE ALGORITHM With the help of the knowledge to generate

directed co-occurrence graphs it is now possible to introduce a new method to analyse them in order to find source topics in the texts they represent. For this purpose the application of the HITS algorithm on these graphs is sensible due to its working method that has been outlined in the introduction. The list of hub nodes in these graphs returned by HITS contain the terms that can be regarded as the source topics of the analysed texts as they represent their inherent concepts. Their hub value indicates their influence on the most important topics and terms that can be found in the list of authorities.

For the calculation of these lists using HITS, it is also sensible to also include the strength of the associations between the terms. These values should also influence the calculation of the authority and hub values. The idea behind this approach is that a random walker is likely to follow links in co-occurrence graphs that lead to terms that can be easily associated with the current term he is visiting. Nodes that contain terms that are linked with a low association value however should not be visited very often. This also means that nodes that lie on paths with links of high association values should be ranked highly as they can be reached easily. Therefore, the formulas for the update rules of the HITS algorithm can be modified to include the association values Assn. The authority value of node x can then be determined using the following formula:


105

(3) Accordingly, the hub value of node x can be

calculated using the following formula:

(4)

The following steps are necessary to obtain a list for the authorities and hubs based on these update rules:

1. Remove stopwords and apply stemming

algorithm on all terms in the text. (Optional)

2. Determine the dominant association for all co-occurrences using formula 1, apply the additional weight on it according to formula 2 and use the set of all these relations as a directed co-occurrence graph G.

3. Determine the authority value a(x) and the hub value h(x) iteratively for all nodes x in G using the formulas 3 and 4 until convergence is reached (the calculated values do not change significantly in two consecutive iterations) or a fixed number of iterations has been executed.

4. Order all nodes descendingly according to their authority and hub values and return these two ordered lists with the terms and their authority and hub values.

Now, the effectiveness of this method will be illustrated by experiments.

IV. EXPERIMENTS

A. Detection of Authorities and Hubs The following tables show for two documents of

the English Wikipedia the lists of the 10 terms with the highest authority and hub values. To conduct these experiments the following parameters have been used:

• removal of stopwords • restriction to nouns • baseform reduction • activated phrase detection

TABLE I. TERMS AND PHRASES WITH HIGH AUTHORITY AND HUB VALUES OF THE WIKIPEDIA-ARTICLE ”LOVE”:

Term Authority value Term/Phrase Hub value

love 0.54 friendship 0.19

human 0.30 intimacy 0.17

god 0.29 passion 0.14

attachment 0.26 religion 0.14

word 0.21 attraction 0.14

form 0.21 platonic love 0.13

life 0.20 interpersonal love 0.13

feel 0.18 heart 0.13

people 0.17 family 0.13

buddhism 0.14 relationship 0.12

TABLE II. TERMS AND PHRASES WITH HIGH AUTHORITY AND HUB VALUES OF THE WIKIPEDIA-ARTICLE ”EARTHQUAKE”:

Term Authority value Term/Phrase Hub value

earthquake 0.48 movement 0.18

earth 0.30 plate 0.16

fault 0.27 boundary 0.15

area 0.23 damage 0.15

boundary 0.18 zone 0.15

plate 0.16 landslide 0.14

structure 0.16 seismic activity 0.14

rupture 0.15 wave 0.13

aftershock 0.15 ground rupture 0.13

tsunami 0.14 propagation 0.12

The examples show that the extended HITS

algorithm can determine the most characteristic terms (authorities) and source topics (hubs) in texts by analysing their directed co-occurrence graphs. Especially the hubs provide useful information to find suitable terms that can be used as search words in queries when background information is needed to a specific topic. However, also the terms found in the authority lists can be used as search words in order to find similar documents. This will be shown in the next subsection.

B. Search Word Extraction The suitability for these terms as search words

will now be shown. For this purpose, the five most important authorities and the five most important hubs of the Wikipedia article "Love" have been combined as search queries and sent to Google. These results have been obtained using the determined authorities:


106

Figure 1: Search results for the authorities of the Wikipedia article ”Love”

The search query containing the hubs of this article will lead to these results:

Figure 2: Search results for the hubs of the Wikipedia article ”Love”

These search results clearly show, that they

primarily deal with either the authorities or the hubs. More experiments confirm this correlation. Using the authorities as queries to Google it is possible to find similar documents to the analysed one in the Web. Usually, the analysed document itself is found among the first search results, which is not surprising though. However, it shows that this approach could be a new

way to detect plagiarised documents. It is also interesting to point out the topic drift in the results when the hubs have been used as queries. This observation indicates that the hubs of documents can be used as means to follow topics across several related documents with the help of Google. This possibility will be elaborated on in more detail in the next and final section of this paper.

V. CONCLUSION In this paper, a new graph-based method to

determine source topics in texts based on an extended version of the HITS algorithm has been introduced and described in detail. Its effectiveness has been shown in the experiments. Furthermore, it has been demonstrated that the characteristic terms and the source topics that this method finds in texts, can be used as search words to find similar and related documents in the World Wide Web. Especially the determined source topics can lead users to documents that primarily deal with these important aspects of their originally analysed texts. This goes beyond a simple search for similar documents as it offers a new way to search for related documents, yet it is not impossible to find similar documents when the source topics are used in queries. This functionality can be seen as a useful addition to Google Scholar (http://scholar.google.com/), which offers users the possibility to search for similar scientific articles. Additionally, interactive search systems can employ this method to provide their users functions to follow topics across multiple documents. The iterative use of source topics as search words in found documents can provide a basis for a fine-grained analysis of topical relations that exist between the search results of two consecutive queries. Documents found in later iterations in suchlike search sessions can give users valuable background information on the content and topics of their originally analysed documents. Another interesting application for this method can be seen in the automatic linking of related documents in large corpora. If a document A primarily deals with the source topics of another document B, then a link from A to B can be set. This way, the herein described approach to obtain directed term associations is modified to gain the same effect on document level, namely to calculate recommendations for specific documents. These automatically determined links can be very useful in terms of positively influencing the ranking of search results, because these links represent semantic relations between documents that have been verified in contrast to manually set links e.g. on websites, which additionally can be automatically evaluated regarding their validity by using this approach. Also, these


107

automatically determined links provide a basis to rearrange returned search results based on the semantic relations between them. These approaches will be examined in later publications in detail.

REFERENCES [1] M. Kubek and H. Unger: Search Word Extraction Using

Extended PageRank Calculations, In Herwig Unger, Kyandoghere Kyamaky, and Janusz Kacprzyk, editors, Autonomous Systems: Developments and Trends, volume 391 of Studies in Computational Intelligence, pages 325–337, Springer Berlin / Heidelberg, 2012

[2] G. Salton, A. Wong and C.S. Yan: A vector space model for automatic indexing, Commun. ACM, 18:613–620, November 1975

[3] G. Heyer, U. Quasthoff and Th. Wittig: Text Mining – Wissensrohstoff Text, W3L Verlag Bochum, 2006

[4] J. M. Kleinberg: Authoritative Sources in a Hyperlinked Environment, In Proc. of ACM-SIAM Symp. Discrete Algorithms, San Francisco, California, pages 668–677, January 1998

[5] L. Page, S. Brin, R. Motwani and T. Winograd: The pagerank citation ranking: Bringing order to the web, Technical report, Stanford Digital Library Technologies Project, 1998

[6] M. Buechler: Flexibles Berechnen von Kookkurrenzen auf strukturierten und unstrukturierten Daten, Master’s thesis, University of Leipzig, 2006

[7] L. R. Dice: Measures of the Amount of Ecologic Association Between Species, Ecology, 26(3):297–302, July 1945

[8] P. Jaccard: Étude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura. Bulletin de la Société Vaudoise des Sciences Naturelles, 37:547–579, 1901

[9] U. Quasthoff and Chr. Wolff: The Poisson Collocation Measure and its Applications, In: Proc. Second International Workshop on Computational Approaches to Collocations, Wien, 2002

[10] T. Dunning: Accurate methods for the statistics of surprise and coincidence, Computational Linguistics, 19(1):61–74, 1994


108

Blended value based e-business modeling approach: A sustainable approach using QFD

Mohammed Naim A. Dewan Curtin Graduate School of Business

Curtin University Perth, Australia

[email protected]

Mohammed A. Quaddus Curtin Graduate School of Business

Curtin University Perth, Australia

[email protected]

Abstract—‘E-business’ and ‘sustainability’ are the two current major global trends.But surprisinglynone of the e-business modeling ideas covers the sustainability aspects of the business. Recently researchers are introducing ‘green IS/IT/ICT’ concept but none of them clearly explains how those concepts will be accommodated inside the e-business models. This research approach, therefore, aims to develop an e-business model in conjunction with sustainability aspects. The model explores and determines the optimal design requirements in developing an e-business model. This research approach also investigates how the sustainability dimensions can be integrated with the value dimensions in developing an e-business model. This modeling approach is unique in the sense that in developing the model sustainability concept is integrated with customer’s value requirements, business’s value requirements, and process’s value requirements instead of only customer’s requirements. QFD, AHP, and Delphi method are used for the analysis of the data. Besides developing the blended value based e-business model this research approach also develops a framework for modeling e-business in conjunction with blended value and sustainability which can be implemented by almost any other businesses in consideration with the business contexts.

Keywords- E-business, Business model, Sustainability, Blended value, QFD, AHP.

I. INTRODUCTION Business modeling is not new and has had significant

impacts on the way businesses are planned and operated today. Whilst business models exist for several narrow areas, broad comprehensive e-business models are still very informal and generic. Majority of the business modeling ideas considers only economic value aspects of the business and do not focus on social or environmental aspects. It is surprising that although ‘e-business’ and ‘sustainability’ are the two current major global trends but none of the e-business modeling ideas covers the sustainability aspects of the business. Researchers are now introducing ‘green IS/IT/ICT’ concept but none of them clearly explains how those concepts will be accommodated inside the e-business models. Therefore, this research approach aims to develop an e-business model in conjunction with sustainability aspects. The model will be based on ‘blended value’ and will explore and determine the optimal design requirements in developing an e-business model. This research approach will also investigate how the sustainability dimensions can be integrated with the value dimensions in developing an e-

business model. This modeling approach is distinct in the sense that in developing the model sustainability concept is integrated with customer’s value requirements, business’s value requirements, and process’s value requirements instead of only customer’s requirements. For the analysis of the dataQuality Function Deployment(QFD), Analytic Hierarchy Process (AHP), and Delphi method are used. Besides developing the blended value based e-business model this research approach will also develop a framework for modeling e-business in conjunction with blended value and sustainability which can be implemented by almost any other businesses in consideration with the business contexts.The following section clarifies the purpose of the approach. Definition of terms used in the approach is in Section 3. Anextensive literature review is covered in Section 4. Section 5 and 6 explicate the research methodology and the research process respectively. Research analysis is explained in Section 7, and finally, Section 8 concludes the article with a discussion.

II. PURPOSE OF THE APPROACH The majority of research into business models in the IS

field has been concerned with e-business and e-commerce [1]. There exist a number of ideas about e-business models but most of them provide only conceptual overview and concentrate only on economic aspects of the business. None of the e-business modeling ideas exclusively considers the sustainability aspects. Similarly, there is a growing number of literature available about the sustainability of businesseswhich do not focus on e-business. But the intersection of these two global trends, e-business and sustainability, need to be addressed. Although recently a very few researchers talks about green IT/IS/ICT concept but none of them clearly explains how that concept will fit in an e-business model to make it sustainable and at the same time, to protect the interests of the customers. This research approach will develop an e-business model based on ‘blended value’ which will be sustainable and will safeguard the interests of the customers.The ‘blended value’ requirements will identify and select the ‘optimal design requirements’ necessary to be implemented for the sustainability of the businesses. Therefore, the main research questions of the approach are as follows:

Q1. What are the optimal/appropriate design requirements in developing an e-business model?


109

Q2.How the sustainability dimensions can be integrated with the value dimensions in developing an e-business model?

Based on the above research questions this research approach is consists of the following objectives:

• To explore and determine the optimal design requirements of an e-business model.

• To investigate how the concept of blended value dimensions can be used in developing an e-business model.

• To investigate how the sustainability dimensions can be integrated with the value dimensions in developing an e-business model.

• To develop a ‘value-sustainability’ framework for modeling e-business in conjunction with blended value and sustainability concepts.

III. DEFINITION OF TERMS

A. Blended Value Blended value is the integration of economic value, social

value, and environmental value for customers, businesses, and value processes. It is different from CSR value in the sense that CSR value is separate from profit maximization and agenda is determined by external reporting, whereas blended value is integral to profit maximization and agenda is company specific and internally generated.

B. Value Requirements Value requirements are the demands for the value by

customers (for satisfaction), businesses (for profit), and business processes (for efficient value process). Value can be economic and/or social and/or environmental demanded by customers and/or businesses and/or business processes to fulfill the customer’s requirements and/or to achieve strategic goals and/or to ensure efficient value processes.

C. Design Requirements Design requirements also known as HOWs are the

requirements required to fulfill the ‘blended value’ requirements in QFD process. After needs are revealed the company’s technicians or product development team develop a set of design requirements in measurable and operable technical terms [2] to fulfill the value requirements.

IV. LITERATURE REVIEW

A. Business model and e-business model Scholars have referred to business model as a statement, a

description, a representation, an architecture, a conceptual toolormodel, a structural template, a method, a framework, a pattern, and as a set found by Zott et al. [3]. A study by Zott et al. [3] found that in a total of 49 conceptual studies in which the business model is clearly defined, almost one fourth of the studies are related to e-business.The majority of research into business models in the IS field has been concerned with e-business and e-commerce; and there have been some attempts to develop convenient classification schemas [1]. For example, definitions, components, and classifications into e-business

models have been suggested [4, 5].Timmers [6] was the first who defined e-business model in terms of the elements and their interrelationships. Applegate [7] introduces the six e-business models: focused distributors, portals, producers, infrastructure distributors, infrastructure portals, and infrastructure producers. Weill and Vitale [8]suggest a subdivision into so called atomic e-business models, which are analyzed according to a number of basic components. There exist few more e-business modeling approaches, such as, Rappa [9],Dubosson-Torbay et al. [10],Tapscott, Ticoll and Lowy [11], Gordijn and Akkermans [12], and more.But sustainability concept is still entirely absent in all of the e-business modeling ideas.

B. Sustainability of Business Sustainable business means a business with “dynamic

balance among three mutually inter dependent elements: (i) protection of ecosystems and natural resources; (ii) economic efficiency; and (iii) consideration of social wellbeing such as jobs, housing, education, medical care and cultural opportunities” [13]. Even though many scholars enlightened their study on sustainability incorporating economic, social, and environmental perspective but still “most companies remain stuck in social responsibility mind-set in which societal issues are at the periphery, not the core. The solution lies in the principle of shared (blended) value, which involves creating economic value in a way that also creates value for society by addressing its needs and challenges” [14]. Moreover, most of the scholars mainly express the needs for blended value and very few of them provide with only hypothetical ideas for maintaining sustainability. A complete business model for sustainability with operational directions is still lacking.

C. E- business and Sustainability E-business is the point where economic value creation and

information technology/ICT come together [15]. ICT can have both positive and negative impact on the society and the environment. Positive impacts can come from dematerialization and online delivery, transport and travel substitution, a host of monitoring and management applications, greater energy efficiency in production and use, and product stewardship and recycling; and negative impacts can come from energy consumption and the materials used in the production and distribution of ICT equipment, energy consumption in use directly and for cooling, short product life cycles and e-waste, and exploitative applications [16]. Technology is a source of environmental contamination during product manufacture, operation, and disposal [17-19]. Corporations have the knowledge, resources, and power to bring about enormous positive changes in the earth’s ecosystems”[20].In consistent with the definition of environmental sustainability of IT [21], sustainability of e-business can be defined as the activities within the e-business domain to minimize the negative impacts and maximize the positive impacts on the society and the environment through the design, production, application, operation, and disposal of information technology and information technology-enabled products and services throughout their life cycle.


110

V. RESEARCH METHODOLOGY In this approach, initially ‘a sustainable e-business

modeling approach based on blended value’ is proposed after considering the previous literature and the research objectives. This proposed model can be tested with the sample data to justify its capability and validity along with the progress of the research. Any businesscan be chosen for data collection. Sample data can be collected from field study by conducting semi-structured interviews with the customers and through focus group meetings with the dept-in-charges. Once the model’s capability is proven, large volume of data will be collected from the customers and the organisations by organizing surveys and focus group meetings to test the comprehensive model. Therefore, both qualitative and quantitative methods will be used in this research approach for data collection and analysis.

A. Research Elements This research approach uses ‘blended value requirements’

and ‘sustainability’ as the main elements. According to our approach, blended value is consists of three values: customer value, business value, and process value. Sustainability of business includes economic value, social value, and environmental value. Therefore, to be competitive in the market the value need to be measured from three dimensions:

− What total value is demanded by the customers? − What total value is required by the businesses based

on their strategy to reach their goals? − What process value is required by the businesses to

have an efficient and sustainable value processes?

Consequently, based on the measurement from three dimensions blended value requirements can be categorised into 9 (nine) groups which will be used as the main elements of this approach. They are as follows:

1) Economic value for customer requirements:This means any of the customer’s value requirements which is somehow economically related directly or indirectly to the product or service that is to be delivered to the customer. In other words, these requirements mean all types of economic benefits that the customers are looking for. For example, price of the product or service, quality, after-sales-service, availability or ease of access, delivery, etc. appear under this category.

2) Social value for customer requirements:Social value requirements for the customer include any value delivered by the businesses for the customer’s society. These social value requirements are not the social responsibilities that the business organisations are thinking to perform, rather these are the requirements that the customers are expecting or indirectly demanding for their society from the products or services or from the supplier of the products or services.

3) Environmental value for customer requirements:Environmental value requirements stand for all the environmental factors related directly or indirectly, to the product or service delivered to the customer or they can be somehow related to the operations of supplier of the product or service, such as, emissions (air, water, and soil), waste,

radiation, noise, vibration, energy intensity, material intensity, heat, direct intervention on nature and landscape, etc [22]. This environmental value is demanded or expected by the customers.

4) Economic value for business requirements:These requirements are those requirements which add some economic value to the business directly or indirectly if they are fulfilled. These economic requirementsare not demanded by the customers instead they are identified by the businesses to be fulfilled to achieve the planned future goals. For example, reducing the cost of production, increase of sales and/or profit, getting cheaper raw materials, minimizing packaging and delivery cost, replacing the employees with more efficient machinery, etc.

5) Social value for business requirements: Social value requirements are to add some value to the society from business’s point of view if they are fulfilled. These value requirements reflect what social value the business is planning and willing to deliver to the customers’ society in time regardless of the customers’ demand. For instance, Lever Bros Ltd. uses few principles to focus on social value, such as, emphasising on employees’ personal development, training, health, and safety; improving well-being of the society at large, etc. [23].

6) Environmental value for business requirements: Adding environmental value can be a competitive advantage for the businesses since businesses can differentiate themselves by creating products or processes that offer environmental benefits. By implementing environmental friendly operations businesses may achieve cost reductions, too. For example, reduced contaminations, recycling of materials, improved waste management, minimize packaging, etc., reduce the impact on the environment and the costs.

7) Economic value for process requirements:These are mainly related to the cost savings within the existing value processes which can be later transferred to the customers. The managers identify these value creating inefficiencies within the existing processes and try to correct them which result in some sort of economic benefits. For example, up-to-date technologies, adequate amount of training, using efficient energies, improved supply chain management systems, etc. can increase the efficiency of the value processes that can certainly add some economic value to the organisation.

8) Social value for process requirements: To identify these requirements managers look at the whole value process of the organisation and see whether there is any scope to add some value to the society they are operating within the existing value process systems. For instance, educating disadvantaged children, organising skills training for unemployed people, employing disabled people, establishing schools and colleges, sponsoring social events, organising social gathering, organising awareness programs etc. can add value to the society and most of these requirements can be easily fulfilled by the businesses without or with a little investments or efforts.


111

Figure 1: Research approach.

9) Environmental value for process requirements:To fulfil these requirements, the businesses try to find and implement all the necessary steps within the existing value processes that will stop or reduce the chances of negative impacts and facilitate positive impacts on the environment, thus, adding some value to the environment. For example, leakage of water/oil/heat, inefficient disposal and recycling of materials, unplanned pollution (air, water, sound) management, heating and lighting inefficiency, etc. within the existing value processes result damages to the environment. Thus these requirements need to be fulfilled to minimize the impact of current value processes on the environment.

B. Research Tools 1) Quality Function Deployment (QFD):QFD supports

the product design and development process, which was laid out in the late 1960s to early 1970s in Japan by Akao [24]. QFD is based on collecting and analysing the voice of the customer that help to develop products with higher quality and meeting customer needs [25]. Therefore, it can be also used to analyse business needs and value process needs. The popular application fields of QFD are product development, quality management and customer needs analysis; however, the utilisation of QFD method has spread out to other manufacturing fields in time [26]. Recently, companies are successfully using QFD as a powerful tool that addresses strategic and operational decisions in businesses [27]. This tool is used in various fields for determining customer needs, developing priorities, formulating annual policies, benchmarking, environmental decision making, etc. Chan and Wu [26] and Mehrjerdi [27] provide a long list of areas where QFD has been applied. QFD, in this approach, will be applied as the main tool to analyse customer needs, business needs, and process value needs. It will also be used to develop and select optimised design requirements based on organisation’s capability to satisfy the blended value requirements for the sustainability of the businesses.

QFD, in this approach, will be applied as the main tool to address customers’ requirements (CRs) and integrate those requirements into design requirements (DRs) to meet the sustainability requirements of buyers and stakeholders.In QFD modeling, ‘customer requirements’ are referred as WHATs and ‘how to fulfil the customer’s requirements’ are referred as HOWs. The process of using appropriate HOWs to

meet the given WHATs is represented as a matrix (Fig. 2) Different users build different QFD models involving different elements but the most simple and widely used QFD model contains at least the customer requirements (WHATs) and their relative importance, technical measures or design requirements (HOWs) and their relationships with the WHATs, and the importance ratings of the HOWs. Six sets of input information is required in a basic QFD model: (i) WHATs: attributes of the product as demanded by the customers, (ii) IMPORTANCE: relative importance of the above attributes as perceived by the customers, (iii) HOWs: design attributes of the product or the technical descriptors, (iv) Correlation Matrix: interrelationships among the design requirements, (v) Relationship Matrix: relationships between WHATs and HOWs (strong, medium or weak), and (vi) Competitive Assessment: assessment of customer satisfaction with the attributes of the product under consideration against the product produced by its competitor or the best manufacturer in the market [32]. The following steps are followed in a QFD analysis:

Step 1: Customers are identified and their needs are collected as WHATs; Step 2: Relative importance ratings of WHATs are determined; Step 3: Competitors are identified, customer competitive analysis is conducted, and customer performance goals for WHATs are set; Step 4: Design requirements (HOWs) are generated; Step 5: Correlation between design requirements (HOWs) are determined; Step 6: Relationships between WHATs and HOWs are determined; Step 7: Initial technical ratings of HOWs are determined; Step 8: Technical competitive analysis is conducted and technical performance goals for HOWs are set; Step 9: Final technical ratings of HOWs are determined.

Lastly, based on the rankings of weights of HOWs the design requirements are selected.

Figure 2: QFD layout.

2) Analytic Hierarchy Process (AHP):Saaty [28] developed analytic hierarchy process which is an established multi-criteria decision making approach that employs aunique


112

method of hierarchical structuring of a problem and subsequent ranking of alternative solutions by a paired comparison technique. The strengths of AHP is lied on its robust and well tested method of solution and its capability ofincorporating both quantitative and qualitative elements in evaluating alternatives [29]. AHP is a powerful and widely-used multi-criteria decision-making technique for prioritizingdecision alternatives of interest [30]. AHP is frequently used in QFD process, for instance, Han et al. [31], Das and Mukherjee [29], Park and Kim [30], Mukherjee [32], Bhattacharya et al.[33], Chan and Wu [34], Han et al. [31], Xie et al. [35], Wang et al. [36] and more. In this research approach, AHPwill be used to prioritize the blended value requirementsbefore developing design requirements inQFD process basedon customer value requirements,business value requirements, and process value requirements.

3) Delphi Method:The Delphi method has proven a popular tool in information systems (IS) research [37-42]which was originally developed in the 1950s by Dalkey and his associates at the Rand Corporation [43]. Okoli and Pawlowski [42] and Grisham [43] provide with the lists of examples of research areas where Delphi was used as the major tool. This research approach will use Delphi method in designing and selecting optimised design requirements for the company in QFD process to develop the blended value based e-business model.

VI. RESEARCH PROCESS Data will be collected from face to face interviews and

structured focus group meetings. In this stage, blended value requirements (economic value, social value, and environmental value for customer’s requirements, business requirements and value process requirements) for particular products will be identified based on the existingvalue proposition, value process and value delivery. Customer requirements will be identified through open-ended semi-structured questionnaires. Business requirements and value process requirements will be identified through focus group meetings with the dept-in-charges. Required number of questionnaires will be collected from thecustomers and based on the feedback from the customersnecessarydata will be collected from structured focus group meetings. Collected data will be analyzed using AHP and QFD. There are few steps that will be used to complete the data analysis: (i) The blended value requirements will be grouped and categorized into classifications based on the type of requirements. Then they will be prioritized using AHP to find out the importance level of each of the requirements; (ii) The target level for each of the total requirement will be set depending on the importance level of the each requirement and the organisation’s capability and strategy. After prioritizing, total requirements will be benchmarked, if necessary, to set the target levels of the requirements; (iii) Based on the target levels of each requirements design requirements will be developed. Design requirements will be developed through Delphi method after structured discussion or focus group meeting with the related dept-in-charges. Design requirements will be benchmarked, if necessary, before setting target values for those requirements. Also, costs will be determined for elevating each design requirement; (iv) A relationship matrix between blended value requirements and design requirements will be

developed using QFD to get the weights of the each design requirement. Then based on the weights (how much each design requirement contributes to meeting each of the total requirements) certain design requirements will be selected initially; (v) Then trade-offs among the initially selected design requirements will be identified for cost savings since improving one design requirement will have a positive, negative, and/or no effect on other design requirements; (vi) Finally, design requirements will be chosen based on the following criteria: − initial technical ratings based relationship matrix between

total requirements and design requirements; − technical priorities depending on organisation’s capability,

and − trade-offs among the design requirements.

VII. RESEARCH ANALYSIS

In QFD process the relationship between a blended valuerequirement (BVR) and a design requirement(DR) is described as Strong, Moderate, Little, or No relationship which are later replaced by weights (e.g. 9, 3, 1, 0) to give the relationship values needed to make the design requirement importance weight calculations. These weights are used to represent the degree of importance attributed to the relationship. Thus, as shown in Table 1, the importance weight of each design requirement can be determined by the following equation: = ∑ ∀ , = 1,…… , (1)

Where, =importance weight of the wth design requirement; =importance weight of the ith blended value requirement; =relationship value between the ith blended value requirement andw th design requirement; = number of design requirements; =number of blended value requirements.

InTable 1, customer requirements, business requirements, and process requirements are considered as part of the blended value requirements. The importance weight of the blended value requirements will be calculated using AHP after getting data from the customers, businesses and the importance weightof the design requirements will be decided by the managers through Delphi method. According to the QFD matrix the absolute importance of the blended value requirements can be determined by the following equation: = ∑ ∀ , = 1,…… , (2)

Where, =absolute importance of the ith blended value requirement (BVR ); =importance weight of theith blended value requirement; =importance weight of the wth design requirement;

Therefore, the absolute importance for the 1st blended value requirements (BVR )will be: = + R D + … . . + R D


113

TABLE I. QFD MATRIX

Requirements DR DR ..... DR A. I. R. I.

CRs

BVR R D R D ..... R D AI RI BVR R D R D ..... R D AI RI

…. ….

….

….

….

….

…. BVR R D R D ..... R D AI RI

BRs


… ….

….

….

….

….


PRs


… ….

….

….

….

….


A. I. AI AI …. AI

R. I. RI RI …. RI

Note: A.I.= Absolute importance; R.I.= Relative importance; DR= Design requirements; CR= Customer requirements; BR= Business requirements; PR= Process requirements; BVR= Blended value requirements.

Thus, the relative importance of the 1st blended value requirements (BVR )will be: = ∑ (3)

Where, =relative importance of the 1st blended value requirement ( ); =absolute importance of the 1stblended value requirement ( );

Similarly, the absolute importance and the relative importance of all other blended value requirements can be determined by following the Equation (2) and (3). Therefore, the absolute value for the first design requirements ( )will be: = + + … . . + In the same way, the relative importance of the 1stdesign requirements can be determined by the following equation: = ∑ (4)

Where, =relative importance of the 1st design requirement ( );

=absolute importance of the 1st design requirement ( );

The absolute importance and the relative importance of all other design requirements can be determined by following the Equation (1) and (4). Once the absolute importance and relative importance of the blended value requirements and design requirements are determined, the cost trade-offs will be identified through correlation matrix of QFD as mentioned in Section IV. The trade-offs among the selected design requirements are identified based on whether improving one design requirement have a positive, negative, and/or no effect on other design requirements. Finally, after considering the initial technical ratings found out from the absolute importance and relative importance of the blended value requirements and design requirements, the organisation’s capability, and the cost trade-offs optimized design requirements will be selected to develop the blended value based sustainable e-business model.

VIII. CONCLUSION AND DISCUSSION There are number of ideas and proposals about business

modeling and e-business modeling. But there is no clear proposal or idea about sustainable e-business modeling. Similarly, there are only few thoughts in the literature about ‘blended value’ or shared value. But all of them considered blended value only from customer’s value requirements point of view. In this approach, all of the value requirements (customer, business, and process) are taken into consideration to develop the model. Therefore, this modeling approach is significant for four reasons. Firstly, there are few modeling approaches exist about ‘e-business’ and ‘sustainable business’ separately, but there is no approach available about e-business modeling and sustainability. Secondly, ‘blended (economic, social, environmental)value’ is considered not only from customer’s point of view but also from business’s point of view and value process’s point of view since the fulfillment of only customer’s requirements cannot guarantee long run sustainability. Thirdly, what was not shown before is how the sustainability dimensions can be integrated with the value dimensions in developing an e-business model. Fourthly, this modeling approach shows the way for efficient allocation of resources for the businesses by indicating theimportance level of the value requirements for the sustainability.

We have shown how the proposed model needs to be implemented with detailed formulas after providing extensive literature in this field. We have also identified the necessary tools for this approach and explained the whole research process step by step. Our further research will be directed at the implementation of this approach in the real life businesses. There should not be much difficulty in implementation of this approach in any real life businesses other than accommodating the elements of this approach in different business contexts.

REFERENCES [1] Al-Debei, M.M. and D. Avison, Developing a unified framework of the

business model concept. European Journal of Information Systems, 2010. 19: p. 359-376.

[2] Chan, L.-K. and M.-L. Wu, A systematic approach to quality function deployment with a full illustrative example. Omega : The International Journal of Management Science, 2005. 33(2): p. 119-139.


114

[3] Zott, C., R. Amit, and L. Massa, The Business Model: Recent Developments and Future Research. Journal of Management, 2011. 37(4): p. 1019-1042.

[4] Alt, R. and H. Zimmerman, Introduction to Special Section - Business Models. Electronic Markets, 2001. 11(1): p. 3-9.

[5] Afua, A. and C. Tucci, eds. Internet Business Models and Strategies. International Editions ed. 2001, McGraw-Hill: New York.

[6] Timmers, P., Business Models for Electronic Markets. Electronic Markets, 1998. 8(2): p. 3-8.

[7] Applegate, L.M., Emerging e-business models: lessons learned from the field. Harvard Business Review, 2001.

[8] Weill, P. and M. Vitale, What IT Infrastructure capabilities are needed to implement e-business models? MIS Quarterly, 2002. 1: p. 17-34.

[9] Rappa, M. Managing the digital enterprise - Business models on the web. 1999 4 April 2011]; Available from: http://digitalenterprise.org/models/models.html.

[10] Dubosson-Torbay, M., A. Osterwalder, and Y. Pigneur, Ebusiness model design, classification and measurements. Thunderbird International Business Review, 2001. 44(1): p. 5-23.

[11] Tapscott, D., A. Lowy, and D. Ticoll, Digital capital: Harnessing the power of business webs. Thunderbird International Business Review, 2000. 44(1): p. 5-23.

[12] Gordijn, J. and H. Akkermans. e3 Value: A Conceptual Value Modeling Approach for e-Business Development. in First International Conference on Knowledge Capture, Workshop Knowledge in e-Business. 2001.

[13] Bell, S. and S. Morse, Sustainability Indicators: measuring the immeasurable2009, London: Earthscan Publications.

[14] Porter, M.E., The big idea: creating shared value. Harvard business review, 2011. 89(1-2).

[15] Akkermans, H., Intelligent e-business: from technology to value. Intelligent Systems, IEEE, 2001. 16(4): p. 8-10.

[16] Houghton, J., ICT and the Environment in Developing Countries: A Review of Opportunities and Developments What Kind of Information Society? Governance, Virtuality, Surveillance, Sustainability, Resilience, J. Berleur, M. Hercheui, and L. Hilty, Editors. 2010, Springer Boston. p. 236-247.

[17] Brigden, K., et al., Cutting edge contamination: A study of environmental pollution during the manufacture of electronic products, 2007, Greenpeace International. p. 79-86.

[18] Greenpeace Guide to Greener Electronics. 2009. [19] WWF/Gartner, WWF-Gartner assessment of global lowcarbon IT

leadership, 2008, Gartner Inc.: Stamford CT. [20] Shrivastava, P., The Role of Corporations in Achieving Ecological

Sustainability. The Academy of Management Review, 1995. 20(4): p. 936-960.

[21] Elliot, S., Transdisciplinary perspectives on environmental sustainability: a resource base and framework for IT-enabled business transformation. MIS Q., 2011. 35(1): p. 197-236.

[22] Figge, F., et al., The Sustainability Balanced Scorecard – linking sustainability management to business strategy. Business Strategy and the Environment, 2002. 11(5): p. 269-284.

[23] Zairi, M. and J. Peters, The impact of social responsibility on business performance. Managerial Auditing Journal, 2002. 17(4): p. 174 - 178.

[24] Akao, Y., Quality Function Deployment (QFD): Integrating customer requirements into Product Design1990, Cambridge, MA: Productivity Press.

[25] Delice, E.K. and Z. Güngör, A mixed integer goal programming model for discrete values of design requirements in QFD. International journal of production research, 2010. 49(10): p. 2941-2957.

[26] Chan, L.-K. and M.-L. Wu, Quality function deployment: A literature review. European Journal of Operational Research, 2002. 143(3): p. 463-497.

[27] Mehrjerdi, Y.Z., Applications and extensions of quality function deployment. Assembly Automation, 2010. 30(4): p. 388-403.

[28] Saaty, T.L., AHP: The Analytic Hierarchy Process1980, New York: McGraw-Hill.

[29] Das, D. and K. Mukherjee, Development of an AHP-QFD framework for designing a tourism product. International Journal of Services and Operations Management, 2008. 4(3): p. 321-344.

[30] Park, T. and K.-J. Kim, Determination of an optimal set of design requirements using house of quality. Journal of Operations Management, 1998. 16(5): p. 569-581.

[31] Han, S.B., et al., A conceptual QFD planning model. The International Journal of Quality & Reliability Management, 2001. 18(8): p. 796.

[32] Mukherjee, K., House of sustainability (HOS) : an innovative approach to achieve sustainability in the Indian coal sector, in Handbook of corporate sustainability : frameworks, strategies and tools, M.A. Quaddus and M.A.B. Siddique, Editors. 2011, Edward Elgar: Massachusetts, USA. p. 57-76.

[33] Bhattacharya, A., B. Sarkar, and S.K. Mukherjee, Integrating AHP with QFD for robot selection under requirement perspective. International journal of production research, 2005. 43(17): p. 3671-3685.

[34] Chan, L.K. and M.L. Wu, Prioritizing the technical measures in Quality Function Deployment. Quality engineering, 1998. 10(3): p. 467-479.

[35] Xie, M., T.N. Goh, and H. Wang, A study of the sensitivity of ‘‘customer voice’’ in QFD analysis. International Journal of Industrial Engineering, 1998. 5(4): p. 301-307.

[36] Wang, H., M. Xie, and T.N. Goh, A comparative study of the prioritization matrix method and the analytic hierarchy process technique in quality function deployment. Total Quality Management, 1998. 9(6): p. 421-430.

[37] Brancheau, J.C., B.D. Janz, and J.C. Wetherbe, Key Issues in Information Systems Management: 1994-95 SIM Delphi Results. MIS Quarterly, 1996. 20(2): p. 225-242.

[38] Hayne, S.C. and C.E. Pollard, A comparative analysis of critical issues facing Canadian information systems personnel: a national and global perspective. Information & Management, 2000. 38(2): p. 73-86.

[39] Holsapple, C.W. and K.D. Joshi, Knowledge manipulation activities: results of a Delphi study. Information & Management, 2002. 39(6): p. 477-490.

[40] Lai, V.S. and W. Chung, Managing international data communications. Commun. ACM, 2002. 45(3): p. 89-93.

[41] Paul, M., Specification of a capability-based IT classification framework. Information & Management, 2002. 39(8): p. 647-658.

[42] Okoli, C. and S.D. Pawlowski, The Delphi method as a research tool: an example, design considerations and applications. Information & Management, 2004. 42(1): p. 15-29.

[43] Grisham, T., The Delphi technique: a method for testing complex and multifaceted topics. International Journal of Managing Projects in Business, 2009. 2(1): p. 112-130.


115

Protein Structure Prediction in 2D Triangular Lattice

Model using Differential Evolution Algorithm

Aditya Narayan Hati Nanda Dulal Jana

IT Department IT Department

NIT Durgapur NIT Durgapur

Durgapur, India Durgapur, India Email:[email protected] Email:[email protected]

Sayantan Mandal Jaya Sil

IT Department Dept. of CS and Tech.

NIT Durgapur Bengal Eng. and Sc. University

Durgapur, India West Bengal, India Email: [email protected] Email:[email protected]

Abstract—Protein Structure Prediction from primary structure

of a protein is a very complex and hard problem in

computational biology. Here we propose differential evolutionary

(DE) algorithm on 2D triangular hydrophobic-polar (HP) lattice

model for predicting the primary structure of a protein. We

propose an efficient and simple backtracking algorithm to avoid

overlapping of the given sequence. This methodology is

experimented on several benchmark sequences and compared

with other similar implementation. We see that the proposed DE

has been performing better and more consistent than the

previous ones.

Keywords-2D Triangular lattice model; Hydrophobic-polar

model; Evolutionary computation; Differential Evolution; protein;

backtracking and protein structure prediction.

I. INTRODUCTION

Protein plays a key role in all biological process. It is a long

sequence of 20 basic amino acids [2].The exact way proteins

fold just after synthesized in the ribosome is unknown. As a

consequence, the prediction of protein structure from its amino

acid sequence is one of the most prominent problems in

bioinformatics. There are several experimental methods for

protein structure prediction such as MRI (magnetic resonance

imaging) and X-ray crystallography. But these methods are

expensive in terms of equipment, computation and time.

Therefore computational approaches to protein structure

prediction are taken care of. HP lattice model [1] is the

simplest and widely used model. in this paper, 2D triangular

lattice model are used for Protein Structure Prediction because

this model resolves the parity problem of 2D HP lattice model

and gives better structure.

From computational point of view, protein structure

prediction in 2D HP model is NP-complete [6]. It can be transformed into an optimization problem. Recently, several methods are proposed to solve the protein structure prediction problem. But there is no efficient method till now as it is an np hard problem. Here we introduce Differential Evolution algorithm with a simple backtracking for correction to make the sequence self-avoiding. The objective of this work is to evaluate the applicability of DE to PSP using 2D triangular HP

model and to compare its performances with other contemporary methods.

II. 2D TRIANGULAR LATTICE MODEL

HP model is the most widely used models. It was introduced

by Dill et al in 1987 [1]. Here 20 basic amino acids are

classified into two categories (I) hydrophobic and (II) polar

according to the affinity towards water.

When a peptide bond occurs between two amino acids,

those two amino acids are said to be consecutive otherwise those are non-consecutive. When two non-consecutive amino acids are placed side by side in lattice, we say that they are in topological contact. We have to design the model in such a way that the sequences must be self-avoiding and the non-consecutive H amino acids make a hydrophobic core. But these HP lattice model possess a flaw referred as the parity problem. The problem is that when two residues of even distance from one another in the sequence are unable to be placed in such a way that they are in topological contact.

In order to solve this parity problem, the 2D triangular HP

lattice model is introduced [10]. In triangular lattice model,

let and be two primary axes of square lattice. Take an

auxiliary axis = + along the diagonal (Fig 1) and skew it

until the angle between, becomes 120ᴼ (Fig 2). By this way,

we obtain 2D HP triangular lattice model. For example (as in

Fig 3), the lattice point P=(x, y) has six neighbors (x+1, y) as

R, (x-1, y) as L, (x, y+1) as LU, (x, y-1) as RD, (x+1, y+1) as

RU and (x-1, y-1) as LD.

(x+1,y)

(x,y+1)

(x,y-1) (x-1,y-1)

(x+1,y+1)

(x-1,y)

Fig 3: The 2D

triangular lattice

model neighbors of

vertex (x,y).

X

W Y

Fig 1: Adding an

auxiliary axis along

the diagonal of a

square lattice.

X

W Y

Fig 2: Skewing

the square lattice

into a

2Dtriangular

lattice.


116

III. PREVIOUS WORK

In 2D HP lattice model, a lot of work has been done using

evolutionary algorithms such as simple genetic algorithm and

other variations [9], differential evolution algorithm with

vector encoding scheme for initialization [11] [12] etc. But the

parity problem in this lattice model is a severe bottleneck.

Therefore 2D triangular lattice model is considered [13].

Recently, application of SGA, HHGA and ERS-GA are done

in this model [10]. Also tabu search, particle swarm

optimization, hybrid algorithms (hybrid of GA & PSO) are

applied in this model.

IV. METHODOLOGY

In this section, the strategies proposed to improve the performance of the DE algorithm, applied in protein structure prediction using 2D triangular lattice model are described.

A. Differential Evolution Algorithm

Differential evolution algorithm was introduced by Storn &

Price [3] [4]. It is an evolutionary algorithm, used for

optimization problem. It is particularly useful if the gradient of

the problem is difficult or even impossible to derive. Consider

a fitness (objective) function

To minimize the function , find

Then is called a global minimum for the function . It is

usually not possible to pinpoint the global minimum exactly.

So candidate solutions with sufficiently good fitness value are

acceptable for practical issue.

There are several variants of DE proposed by Storn. We

consider DE/rand/1/bin and DE/best-to-rand/2/exp in this

problem. At first, the first strategy is taken. When stagnation

happens for 100 generations the second strategy is taken. After

that when again stagnation happens the first strategy is taken

and so on. The algorithm (Fig 4) is described below.

Here after initialization of the population, the algorithm goes

after some operations and calculate their fitness values. At

first, mutation happens. There are two types of mutation

strategies has been used. At the starting point, first strategy is

taken. If stagnation occurs for more than 100 next generations,

the second strategy is chosen. If again stagnation occurs in the

next 100 generations then the first one is chosen and so on.

When first strategy is chosen binomial crossover happens and

for second strategy is chosen exponential crossover is chosen.

After that a repair function has been called to convert

infeasible solutions to feasible ones. Then the selection

procedure is done based on the greedy strategy. The

initialization, mutation, crossover and selection is described in

the following section.

1) Initialization

In DE, for each individual component, the upper bound and

lower bound are stored in matrix, called initialization

matrix where D is the dimension of the each individual. The

vector components are created in the following way,

Where

There are basically 3 types of coordinates to represent the

amino acids in lattice, Cartesian coordinates, internal

coordinates and relative coordinates. The proposed DE uses

relative coordinates. Based on this model, there are possible 6

movements L, R, LU, LD, RU, RD defined as from a point

P(x, y). They are as follows: (x, y+1) as LU, (x-1, y) as L, (x-

1, y-1) as LD, (x, y-1) as RD, (x+1, y) as R, (x+1, y+1) as RU.

If the number of amino acids in the given sequence is n, then

total number of moves in the amino acid sequence is (n-1). For

each target vector, we choose randomly (n-1) number of

moves from 1to 6. By this way we initialize the whole

population matrix calculate the energy for each target vector

using the fitness function. If the target vector is infeasible we

set its fitness function value to1.the number of population is

np, which is a parameter of DE. We take the value of np as 5

times of dimension of target vector.

2) Mutation

Mutation is a kind of exploration technique which can explore

very rapidly. It creates trial vector of np numbers. The

mutation process of the first strategy is as follows:

Where

The mutation process of second strategy is as follows:

Where

Here Xbest is the best target vector in that current generation.

is a parameter called weighting factor. Here we

have taken the value of F=0.25*(1+rand). Here rand is a

uniform random generator.

a f

: nf R R

/na R

: ( ) ( )nb R f a f b

f

Lb 2 D

, , 0 (0, 1) ( , , ) ,j i rand bj U bj L bj Lx

0 (0,1) 1rand

(0,1 )F

0 1 2( ) ( )ig r best r best rV X F X X F X X

0 1 2r r r i

0 1 2( )ig r r rV X F X X

0 1 2r r r i


117

3) Crossover

Crossover is also a kind of exploration technique which

explores in a restricted manner. In DE there are two of basic

crossover techniques. They are (i) binomial crossover and (ii)

exponential crossover. As we use DE/rand/1/bin and DE/best-

to-rand/2/exp, we consider binomial and exponential crossover

both.

Binomial crossover is as follows:

The exponential crossover is as follows:

For all j=<n>D, <n+1>D … <n+L-1>D

For all other

is a parameter called crossover probability. In the

exponential crossover <>D operator means modulo operator.

We have taken Cr=0.8. jrand is a random value chosen from

1to d where d is the dimension of target vector. The pseudo

codes of binomial and exponential crossover are given below:

Here D represents the dimension of each vector. Uj,i is the trial

vector and Vj,i is the donor vector. Xj,i is the target vector.

4) Selection

Selection is an exploitation technique which converges from

local minima to global minimum. After doing mutation and

crossover we have introduced a repair function to repair the

infeasible solutions. Then we calculate fitness function from

the coordinates. The fitness function is the free energy of the

sequence calculated from the model. The minimum energy of

a sequence implies more stability of the molecule. Here we

consider hydrophobic-polar model. By using this concept the

Procedure of free energy calculation is described later. The

repair process is described later. It is a tournament selection

procedure based on the value of fitness function. The

procedure is as follows:

If

B. Repair function

After applying mutation and crossover operation, the initial

solution or target vector becomes infeasible i.e. the sequence

becomes non-self-avoiding. There are three ways to solve such

problem which is discarding the infeasible solution, using of

penalty function and repair function using backtracking. We

proposed here the repair function using backtracking. The

third option is illustrated in fig 7.

The random movement is stored in 'S'. Each node has a value

'back' which stores number of invalid direction. Whenever

back value will be greater than 5 it will cause backtrack. A

pointer 'i' is used which will keep track of current working

node. Now every time when backtrack will occur this 'i'

pointer will decrease and the back value of the node will be set

to 0. For a particular node, if placed its back value will

increase by 1, now if a particular direction is not available then

some strategy is taken which will be followed to place that

amino acid to a new direction. The strategy is, if right is not

available then it will try to place in down direction, now if

down is not available then left, if left is not available then it

will move up direction, and if up is not available then right

direction. Number of attempt to place that particular amino

acid with respect to a particular coordinate will be 4. Now

when value of 'i' is equal to the length of the amino acid and it

means it has repaired the whole folding.

Fig 7: Repair function using backtracking

(0,1)Cr

, ,, ,

, ,

j i gj i g

j i g

VUX

otherwise

( (0,1) )if rand Cr j jrand

, ,, ,

, , j i g

j i gj i g

VUX

[1, ]j D

jr=floor (rand(O,1)*D); //0≤jr<D

j=jr;

do

Uj,i=Vj,i; //Child inherits a mutant parameter.

J= (j+1) %D; //Increment j modulo D.

while (rand(0, 1)<cr && j!=jr); //Take another mutant vector?

While (j! =jr) //Take the rest, if any, from the target.

Uj,i=Vj,;

j=(j+1)%D;

Fig 5: Pseudo code of exponential crossover.

jr=floor (rand(0,1)*D); //0≤jr<D

for j=1 to n

If rand (0,1)<cr or j=jr

Uj,i=Vj,i;

Else

Uj,i=Xj,i;

Fig 6: Pseudo code of binomial crossover.

,, 1

,Ui g

i gXi g

X otherwise

, ,( ) ( )i g i gf U f X


118

C. The free energy calculation procedure

The free energy of an amino acid sequence is the topological

contact of nonconsecutive H-H contacts. The free energy of a

protein can be calculated by the following formulae:

,

ij ij

i j

E r

Where the parameters

The pair of H and H residues

Others

And

Si and Sj adjacent but not connected amino acid

Otherwise

V. RESULTS AND COMPARISION

For experiments, benchmark sequences of 8 synthetic proteins

have been chosen in 2D HP lattice model [10]. The minimum

energy of these sequences is still unknown in 2D triangular

lattice model. In this model, Simple Genetic Algorithm

(SGA), Hybrid Genetic Algorithm (HGA) and Elite-based

Reproduction Strategy with Genetic Algorithm (ERS-GA)

have been proposed earlier. Comparing our results with these

algorithms’ results it is seen that the DE scheme has

outperformed the previous algorithms also DE works more

consistently in 2D triangular lattice model.

For this experiments, a machine with Pentium Core 2 Duo

(1.6GHz) and 2GB RAM with Linux is used. Octave is used

as the testing platform. We have done 20 times run of each

benchmark sequences with this algorithm and compare with

the available results. Table 1 shows the benchmark sequences

on which we apply our algorithm. Table 2 show the

comparison between the minimum energy found by SGA,

HGA, and ERS-GA. Table 3 show the comparison between

minimum energy (avg/best) between ERS-GA and DE.

From the table 2 it can be observed that the results obtained

from DE is better than the results of others for second and

eighth benchmark sequences. The results in bold cases

represent the better ones. For first, third and fourth sequences

both ERS-GA and DE work fine. For the rest of the

benchmark sequences, ERS-GA gives better results. But if the

observation is done from the table 3 then it can be concluded

that DE works most consistently for these benchmark

sequences as it give higher average results for 6 of 8

benchmark sequences.

TABLE I. LIST OF 8 BENCHMARK SEQUENCES

TABLE II. THE BEST FREE ENERGY OBTAINED BY EACH ALGORITHM

TABLE III. COMPARISON BETWEEN THE RESULTS (AVG/BEST) BETWEEN

ERS-GA AND DE

VI. FUTURE WORK AND CONCLUSION

There are a little bit of exploration in the field of protein

structure prediction in 2D triangular lattice using evolutionary

strategy. In this paper, Differential Evolution algorithm is

implemented for protein structure prediction. Here new type of

encoding scheme is also proposed. Invalid conformations are

repaired by using backtracking method to produce valid

conformations. Our experimental results show very promising

and efficient than other evolutionary algorithms. In future,

better results can be found by upgrading this DE strategy.

There is a lot of scope in DE for upgrading in the area of

initialization, mutation, crossover and selection operations.

The improvement can lead to a better sub-optimal solution for

this problem.

REFERENCES

[1] Lau, K. and Dill, K. A., “A lattice statistical mechanics model of the

conformation and sequence spaces of proteins” Macromolecules, vol. 22, pp. 3986–3997, 1989

[2] Charles J. Epstein, Robert F. Goldberger, and Christian B. Anfinsen. “The genetic control of tertiary protein structure: Studies with model systems” In Cold Spring Harbor Symposium on Quantitative Biology, pages 439–449, 1963. Vol. 28.

[3] R. Stom and K. Price, "Differential Evolution - A Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces", ftp.ICSI.Berkeley.edu/pub/techreports/l9 5/tr--9 5 012.ps.z

[4] R. Storn, "On the usage of differential evolution for function optimization " Biennial Conference of the North American Fuzzy Information Processing Society (NAFIPS), IEEE, Berkeley, 1996, pp. 519-523.

[5] R. Agarwala, S. Batzoglou, V. Dancik, S. Decatur, M. Farach, S. Hannenhali, S. Muthukrishnan, and S. Skiena. Local rules for protein folding on a triangular lattice and generalized hydrophobicity in the HP model. Journal of Computational Biology, 4(2):275-296, 1997.

Sequence length Amino acid sequence

1 20 (HP)2PH(HP)2(PH)2HP(PH)2

2 24 H2P2(HP2)6H2

3 25 P2HP2(H2P4)3H2

4 36 P(P2H2)2P5H5(H2P2)2P2H(HP2)2

5 48 P2H(P2H2)2P5H10P6(H2P2)2HP2H5

6 50 H2(PH)3PH4PH(P3H)2P4(HP3)2HPH4(PH)3PH2

7 60 P(PH3)2H5P3H10PHP3H12P4H6PH2PHP

8 64 H12(PH)2((P2H2)2P2H)3(PH)2H11

Sequence SGA HGA ERS-GA DE

1 -11 -15 -15 -15

2 -10 -13 -13 -15

3 -10 -10 -12 -12

4 -16 -19 -20 -20

5 -26 -32 -32 -28

6 -21 -23 -30 -27

7 -40 -46 -55 -49

8 -33 -46 -47 -50

Sequence ERS-GA(avg/best) DE(avg/best)

1 -12.5/-15 -14.8/-15

2 -10.2/-13 -13.4/-15

3 -8.47/-12 -11.2/-12

4 -16/-20 -19.4/-20

5 -28.13/-32 -27/-28

6 -25.3/-30 -26/-27

7 -49.43/-55 -47.2/-49

8 -42.37/-47 -48.2/-50

1.00.0

ij

10

ijr


119

[6] . Berger, T. Leighton. Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. Journal of Computational Biology, 5(1), 27-40, 1998.

[7] Huang C, Yang X, He Z: Protein folding simulations of 2D HP model by the genetic algorithm based on optimal secondary structures. Computational Biology and Chemistry 2010, 34:137-142.

[8] Joel G, Martin M, Minghui J: RNA folding on the 3D triangular lattice. BMC Bioinformatics 2009, 10:369.

[9] Hoque MT, Chetty M, Dooley LS: A hybrid genetic algorithm for 2D FCC hydrophobic–hydrophilic lattice model to predict protein folding. Advances in Artificial Intelligence, Lecture Notes in Computer Science 2006, 4304:867-876.

[10] Shih-Chieh Su, Cheng-Jian Lin, Chuan-Kang Ting: An effective hybrid of hill climbing and genetic algorithm for 2D triangular protein structure

prediction from International Workshop on Computational Proteomics Hong Kong, China. 18-21 December 2010.

[11] H. S. Lopes, Reginaldo Bitello: A Differential Evolution Approach for protein folding using a lattice model. Journal of Computer Science and Technology,22(6):904~908Nov.2007.

[12] N. D. Jana and Jaya Sil. “Protein Structure Prediction in 2D HP lattice model using differential evolutionary algorithms”. In S. C. Satapathy et al (EDs) Proc. Of the Incon INDIA2012, AISC 132,PP. 281-290.2012.

[13] William E. Hart and Alantha Newman, “Protein Structure Prediction with lattice models”, 2001 by CRC Press.


120

Elimination of Materializations from Left/RightDeep Data Integration Plans

Janusz R. Getta∗∗School of Computer Science and Software Engineering

University of WollongongWollongong, NSW

AustraliaEmail: [email protected]

Abstract—Performance of distributed data processing is one ofthe central problems in the development of modern informationsystems. This work considers a model of distributed system wherea user application running at a central site submits the dataprocessing requests to the remote sites. The results of processingat the remote sites are transmitted back to a central site andare simultaneously integrated into the final outcomes of theapplication.

Due to the external factors like network congestion or highprocessing load at the remote sites, transmission of data toacentral site can be delayed or completely stopped. Then, it isnecessary to dynamically change the original data integrationplan to a new one, which allows for more efficent data integrationin a changed environment.

This work uses a technique of elimination of materializationsfrom the data integration plans to create the alternative dataintegration plans. We propose an algorithm, which finds allpossible data integration plans for a given sequence of datatransmitted to a central site. We show how a data integrationplans can be dynamically changed in a reply to the dynamicallychanging frequencies of data transmission.

I. I NTRODUCTION

Distributed data processing faces an ever increasing demandfor more efficient processing of user applications accessingdata at numerous different locations and integrating the partialresults at a central site. A distributed system based on a globalview data processes the information resources available attheremotes sites through the applications running at a centralsite.A typical user application submits a data processing request toa global view of data, which integrates data resources availableat the remote sites. The request is automatically decomposedinto the elementary requests, which later on, are submittedfor processing at the remote sites. The results of processingat the remote sites are transmitted back to a central site andintegrated with data already available there. Data integrationis performed accordingly to a data integration plan, whichis prepared when a request issued by a user application isdecomposed into the individual requests each one related toa different remote site. A data integration plan determinesan order in which the individual requests are submitted forprocessing at the remote sites and a way how the partialresults of are combined into the final results. Due to the factorsbeyond the control of a central system the transmissions ofpartial results can be delayed or even completely stopped.

Then, the current data integration plan must be dynamicallyadjusted to the changing conditions. This work investigateswhen and how the current data integration plan must bechanged in a reply to the increasing/decreasing intensity oftransmission of data from the remote sites.

The individual requests obtained from the decomposition ofa global request are submitted for processing at the remotesites accordingly toentirely sequentialor entirely parallel,or hybrid, i.e. mixed sequential and parallel strategies. Ac-cordingly to anentirely sequentialstrategy a requestqi canbe submitted for processing at a remote site only when allresults of the requestsq1, . . . , qi−1 are available at a centralsite. An entirely sequential strategy is appropriate when theresults received so far can be used to reduce the complexity ofthe remaining requestsqi, . . . , qi+k. Accordingly to anentirelyparallel strategy all requestsq1, . . . , qi, . . . , qi+k are submittedsimultaneously for the parallel processing at the remote sites.An entirely parallel strategy is beneficial when the compu-tational complexity and the amounts of data transmitted ismore or less the same for all requests. Accordingly to amixedsequential and parallelstrategy some requests are submittedsequentially while the others in parallel. Optimization ofdata integration plans is eitherstatic when the plans areoptimized before a stage of data integration or it isdynamicwhen the plans are changed during the processing of therequests. A static optimization of data integration plans is moreappropriate for parallel strategy than for sequential strategybecause the plans cannot be changed after the submission tothe remote sites. A dynamic optimization of data integrationplans allows for the modification of the individual requestsand change of their order during the processing of an entirerequest. This work considers a dynamic optimization of dataintegration plans for the entirely parallel processing strategyof the individual requests.

The problem of dynamic optimization of data integrationplans in the entirely parallel processing model can be for-mulated in the following way. Given a global informationsystem that integrates a number of remote and independentsources of data. A user requestq is decomposed into theelementary requestsq1, . . . , qn such thatq = E(q1, . . . , qn).The requestsq1, . . . , qn are simultaneously submitted for theprocessing at the remote sites. Letr1, . . . , rn be the individual


121

results obtained from the processing of the respective requestsq1, . . . , qn. Then, the final result of a requestq can beobtained from the evaluation of an expressionE(r1, . . . , rn).If evaluation of an expressionE(r1, . . . , rn) can be performedin many different ways, for example by changing of an orderof operations, then it means that integration of the individualresults r1, . . . , rn can also be performed in many differentways. If some of the individual results are not available dueto a network congestion then a way howE(r1, . . . , rn) isevaluated can be changed to avoid a deadlock. A probleminvestigated in this paper is how to dynamically adjust evalua-tion of E(r1, . . . , rn) in a response to the changing parametersof data transmission.

One of the specific approaches to data integration ison-line data integration. In online data integration we con-sider an individual replyri as a sequence of data pack-ets ri1 , ri2 , . . . , rik−1

, rikand we perform re-computation of

E(r1, . . . , rn) each time a new packet of data is receivedat a central site. Such approach to data integration is moreefficient because there is no need to wait for the completeresults to start evaluation ofE(r1, . . . , rn). Instead, each timea new packet of data is received at a central site then it isimmediately integrated into the intermediate result no matterwhich remote site it comes from. To perform an online dataintegration, an expressionE(r1, . . . , rn) must be transformedinto a collection of the sequences of elementary operationscalled asdata integration plans, pr1

, . . . , prn. Each one of

the data integration planspridetermines howE(r1, . . . , rn) is

recomputed for the sequences of packetsri1 , ri2 , . . . , rik−1, rik

where i = 1, . . . , n. If an expressionE(r1, . . . , rn) can becomputed in may different ways then it is possible to findmany online data integration plans.Dynamic optimizationofonline data integration plans finds the best processing planforthe sequences of packets of data obtained in the latest periodof time. If the frequences of transmission of individual resultsr1, . . . , rn change in time then dynamic optimization finds adata integration plan, which is the best for the most recentfrequencies of data transmission.

A starting point for the dynamic optimization is a dataintegration expressionE(r1, . . . , rn). Next, a data integrationexpression is transformed into a set ofdata integration planswhere each plan represents an integration procedure for theincrements of one argument ofE(r1, . . . , rn) Some of theplans assume that temporary results of the processing mustbe stored in so calledmaterializationswhile the other plansallow for processing of the same data integration expressionwithout the materializations. A data integration system storesall plans and starts data integration accordingly to a planwith the largest possible number of materializations. Then,whenever frequency of data transmission of a given individualresult grows beyond a given threshold then dynamic optimizerfinds a better data integration plan and changes data integrationaccordingly to a new plan.

The paper is organized in the following way. Section IIoverviews the related works in an area of optimization ofdata integration in distributed systems based on a global data

model. Next, Section III shows how online data integrationplans can be transformed into data integration plans thatinclude the largest possible number of materializations. InSection IV we show when and how materializations can beeliminated from left/right deep data integration plans andwhenand how to dynamically change the current data integrationplan. Finally, section VI concludes the paper.

II. PREVIOUS WORKS

The early works [1], [2] on optimization of query processingin distributed database systems, multidatabase, and federateddatabase systems are a starting point of research on efficientprocessing of data integration.

Reactive query processingstarts from a pre-optimized planand whenever the external factors like network problems orunexpected congestion at a local site or unavailability of datamake the current plan ineffective then further processing iscontinued accordingly to an updated plan. The early workson the reactive query processing techniques were either basedon partitioning [3], [4] or ondynamic modification of queryprocessing plans[5], [6], [7]. If the further computationsare no longer possible then partitioning decomposes a queryexecution plan into a number of sub-plans and it attemptsto continue processing accordingly to the sub-plans. Dynamicmodification of query processing plans finds a new plan equiv-alent to the original one and such that it is possible to continueintegration of the available data. The techniques ofqueryscrambling [8], [9], dynamic scheduling of operators[10],and Eddies[11], dynamically change an order in which thejoin operations are executed depending on the join argumentsavailable on hand.

As data integration requires efficient processing of se-quences of data items an important research directions werethe improvements to pipelined implementation of join oper-ation. These works include new versions of pipelined joinoperation such aspipelined join operator XJoin[12], ripplejoin [13], double pipelined join[14], andhash-merge join[15].

A technique ofredundant computationssimultaneously in-tegrates data accordingly to a number of plans [16]. A conceptof state modulesdescribed in [17] allows for concurrentprocessing of the tuples through the dynamic division of dataintegration tasks.

An adaptive and online processing of data integration plansproposed in [18] and later on in [19] considers the setsof elementary operations for data integration and the bestintegration plan for recently transmitted data. The recentwork[20] considers an integration model where the packets of datacoming from the external sites are simultaneously integratedinto the final result. Another work [21] describes a systemof data integration where the initial and simultaneous dataintegration plans are automatically transformed into hybridplans. where some tasks are processed sequentially while theothers are processed simultaneously.

This work concentrates on simultaneous processing of aspecific class of data integration plans whose syntax treesare only left/righ deep and involve the operations of join and


122

antijoin from the relational algebra. We show when and howdata integrationn plans must be dynamically changed due tothe changing frequencies of data transmission.

The reviews of the most important data integration tech-niques proposed so far are included in [22], [23], [24], [25],[26].

III. D ATA INTEGRATION EXPRESSIONS

This work applies a relational model of data to formallyrepresent data containers at the remote systems. Letx be anonempty set of attribute names later on called as aschemaand let dom(a) denotes a domain of attributea ∈ x. Atuple t defined over a schemax is a full mappingt : x →∪a∈xdom(a) and such that∀a ∈ x, t(a) ∈ dom(a). Arelational tablecreated on a schemax is a set of tuples overa schemax.

Let r, s be the relational tables such thatschema(r) = x,schema(s) = y respectively and letz ⊆ x, v ⊆ (x ∩ y), andv 6= ∅. The symbolsσφ, πz, ⋊⋉v, ∼v, ⋉ v, ∪, ∩, − denotethe relational algebra operations ofselection, projection, join,antijoin, semijoin, and set algebra operations ofunion, inter-section, anddifference. All join operations are considered tobe equijoin operations over a set of attributesv.

A modificationof a relational tabler is denoted byδ andit is defined as a pair of disjoint relational tables<δ−, δ+>

such thatr ∩ δ− = δ− andr ∩ δ+ = ∅.An data integration operationthat applies a modificationδ

to a relational tabler is denoted byr ⊕ δ and it is defined asan expression(r − δ−) ∪ δ+.

Let E(r1, . . . , ri, . . . , rn) be adata integration expression.In order to perform data integration simultaneously with datatransmission, each time a data packetδi arrives at a centralsite and it is integrated with an argumentri, an expressionE(r1, . . . , ri ⊕ δi, . . . , rn) must be recomputed.

Obviously, processing of the entire expression from verybeginning is too time consuming. It is faster to do it in anincremental way through processing of an incrementδi withthe previous result of an expressionE(r1, . . . , ri, . . . , rn).

Let P(r, s) be an operation of relational algebra. Then,incremental processing ofP(r ⊕ δ, s) can be computedas P(r, s) ⊕ αP (δ, s) where αP(δ, s) is an incremen-tal/decremental operation(id-operation) of an arguments.The incremental processing ofP(r, s ⊕ δ) can be com-puted asP(r, s) ⊕ βP(r, δ) whereβP(r, δ) is an incremen-tal/decremental operation(id-operation) of an arguments.The id-operationsαP (δ, s) and βP(r, δ) for union, join andantijoin operations of the relational algebra are as follows [20]:

α∪(δ, s) =< δ− − s, δ+ − s > (1)

β∪(r, δ) =< δ− − r, δ+ − r > (2)

α⋊⋉(δ, s) =< δ− ⋊⋉v s, δ+⋊⋉v s > (3)

β⋊⋉(r, δ) =< δ− ⋊⋉v r, δ+⋊⋉v r > (4)

α∼(δ, s) =< δ− ∼v s, δ+ ∼v s > (5)

β∼(r, δ) =< r ⋉ vδ+, r ⋉ vδ− > (6)

δw

z

z

z

v

t

r s

U

~

Fig. 1. A syntax tree of a data integration expression(v ∪ (r ⋊⋉x s) ∼y

t)) ⋊⋉z δw)

In this work we consider data integration expressions wherean operation ofprojection (π) is applied only to the finalresult of the computations and operation ofselection(σ) isperformed together with the binary operations ofjoin (⋊⋉) andantijoin (∼). An operation of union is distributive over theoperations of join and antijoin. It is true that(r ∪ s) ⋊⋉x t =(r ⋊⋉x t)∪(s ⋊⋉x t) and that(r∪s) ∼x t = (r ∼x t)∪(s ∼x t)and that t ∼x (r ∪ s) = (t ∼x r) ∼x s. It means thatunion operations can always be processed and the end of thecomputation of data integration expression. Hence, withoutloosing generality we consider only data integration expres-sions built over the operation of join and antijoin. A sampledata integration expression(v ∪ (r ⋊⋉x s) ∼y t)) ⋊⋉z δw) hasa syntax tree given in Figure (1).

As a simple example consider a data integration expressionE(r, s, t) = t ∼v (r ⋊⋉z s). Assume that we would like tofind how an incrementδs =<∅, δ+

s > of an arguments canbe processed in an incremental way, i.e. we would like torecompute an expressionE(r, s⊕δs, t) using the pevious resultE(r, s, t) and the incrementδs. Application of the equations(4) and (6) provides a solutionE(r, s, t) − (t ⋉ v(δ

+s ⋊⋉ r)).

Next, we consider the processing of an incrementδt =<∅, δ+

t > of a remote data sourcet. In this case we needeither materialization of an intermediate result of a subex-pression(r ⋊⋉z s) or transformation of the data integrationexpression into an equivalent one with either left- or right-deep syntax tree and with an argumentt in the leftmost orrightmost position of the tree.

If a materializationmrs = r ⋊⋉ s is maintained duringthe processing of data integration expression then from anequation (4) we getδrst =< ∅, δ+

t ∼v mrs > and theincremental processing is performed accordingly toE(r, s, t)∪(δ+

t ∼v mrs).Maintenance of a materializationmrs decreases the perfor-

mance of data integration because each time the incrementsδ+r and δ+

s are reprocessed a materializatonmrs must beintegrated with the resultsδ+

r ⋊⋉ mrs and δ+s ⋊⋉ mrs. If the

incrementsδ+r and δ+

s arrive frequently at a data integrationsite then a materializationmrs must be frequently integratedwith the partial results.

If a schema ofδt has common attributes inx with r thenit is possible to transform an expressionδ+

t ∼v (r ⋊⋉ s))into δ+

t ∼v ((r ⋉ xδ+t ) ⋊⋉ s). Then the computations of

r ⋉ xδ+t ) ⋊⋉ s can be performed faster than(r ⋊⋉ s) because


123

an incrementδt is small and it can be kept all the time in fasttransient memory. We shall call such transformation aselimi-nation of materializationfrom a data integration expression.

IV. DATA INTEGRATION PLANS

A data integration expression is transformed into a setof data integration planswhere each plan represents anintegration procedure for the increments of one argument ofthe original expression. In our approach a data integrationplanis a sequence of so calledid-operationson the increments ordecrements of data containers and other fixed size containers.In order to reduces the size of arguments, static optimizationof data integration plans moves the unary operation towardsthe beginning of a plan. Additionally, the frequently updatedmaterializations are eliminated from the plan and constantarguments and subexpressions are replaced with the pre-computed values A data integration plan is a sequence ofassignment statementss1, . . . , sm where the right hand sideof each statement is either an application of a modification toa data container (mj := mj ⊕ δi) or an application of left orright id-operation (δj := αj(δi, mk)). A transformation of dataintegration expression into data integration plans is describedin [20]. As a simple example consider a data integrationexpressionE(r, s, t) = t ∼v (r ⋊⋉z s).The data integration planspr and ps for the increments ofargumentr ands are the following.pr : δrs := δr ⋊⋉z s;mrs := mrs ⊕ δrs;δrst := t ⋉ δrs;result := result − δrst;

ps : δrs := r ⋊⋉z δs;mrs := mrs ⊕ δrs;δrst := t ⋉ δrs;result := result − δrst;

A data integration plan for an argumentt is the following.δrst := δt ∼v mrs;result := result ∪ δrst;

A data integration plan can also be represented as anextended syntax tree where the materializations are representedas square boxes attached to the edges of a tree, for example seeFigure (2), or Figure (4). We say that data integration plan isa left/right deep data integration plan if it has a left/right deepsyntax tree, i.e. a tree such that its every non-leaf node hasat most one non-leaf descendant node, see Figures (2) or (3).In this work we consider only left/right deep data integrationplans.

V. ELIMINATION OF MATERIALIZATIONS

Elimination of materializations from data integration plansis motivated by the performance reasons. When a stream ofdata passing through operation of integration with a materi-alized view, for example in a statementmrs := mrs ⊕ δrs;in the example above, is too large then integration takes too

δ(de)

rsm~

~

d

t(bd)

r(ab) s(ac)

Fig. 2. A case when a materializationmrs cannot be removed from a dataintegration plan forδde.

much time. Integration of the increments of data with a ma-terialization is needed in left/right deep data integration planswhen its incremented argument is not one of two argumentsat the lowest level of its syntax tree. A simple solution tothis problem would be to transform a left/right deep dataintegration plan such that an incremented argument is locatedat the bottom level of the syntax tree. Such transformation isalways possible when a data integration plan is built only overthe join operations. When a data integration expression is builtover join and antijoin operation then in some cases the mate-rializations cannot be removed. For example, it is impossibleto eliminate materializationmrs from a data integration planwhose syntax tree is given in Figure (2) because the incrementsδ(de) have no common attributes with the argumentss(ab) andr(ac). Elimination of materializations from data integrationplans is controlled by the following algorithm.Algorithm (1)

(1) Consider a fragment of data integration plan where anincrementδ(z) and materializationm(y) are involved inoperationα(δ(z), m(y)), see Figure (3). The operationis performed over a set of attributexα. An objective isto eliminate materializationm(y) from the computationsof operationα ∈ ⋊⋉,∼, ⋉ . A materializationm(y)is a result of an operationβ(r(v), s(w)) where β ∈⋊⋉,∼, ⋉ . At most one of the arguments of operationβ(r(v), s(w)) is a materialization.

(2) If r(v) is not a materialization andxα ∩ v is not emptythen r(v) can be reduced tor(v) ⋉ δ(xα) whereδ(xα)is a projection ofδ(z) on xα.

(3) If s(w) is not a materialization andxα ∩w is not emptythens(w) can be reduced tos(w) ⋉ δ(xα).

(4) If both r(v) or s(w) are reduced then no more material-ization can be eliminated because a leaf level of left/rightdeep syntax tree of data integration expression has beenreached.

(5) If either r(v) or s(w) is a materialization then considera subtree with an operation⋉ in the root node as anoperationα. Next, considerδ(z) as one of the argumentsof operationα and eitherr(v) or s(w) as the secondargument of operationα. Finally, consider operationβwhose results are eitherr(v) or s(w). Next, re-apply thealgorithm from the step (1).

Correctness of the Algorithm (1) comes from the followingobservations. A result of operationαxα

(δ(z), m(y))) does not


124

δ(z)

xα

xβ

α

β

m(y)

r(v) s(w)

Fig. 3. Syntax tree of a fragment of data integration plan

m (abc)2

1m (bc)

~

~

δ(abd)

ab

r(ac)

t(bd)s(bc)

c

b

Fig. 4. Syntax tree of data integration plan

change ifm(y) is reduced byδ(z) i.e. it is the same as areusult ofαxα

(δ(z), m(y) ⋉ δ(z))) for any α ∈ ⋊⋉,∼, ⋉ .As m(y) is a result ofβxβ

(r(v), s(w))) and operation⋉ isdisributive overβ ∈ ⋊⋉,∼, ⋉ then δ(z) can be used toreduce either one or bothr(v) ands(w). Hence an operationβ can be effectively recomputed on the reduced arguments andthere is no need to store a materializationm(y).

As an example consider a syntax tree of data integrationplan together with the materialization required for the compu-tation of are given in Figure (4). Application of the steps (1),(2) and (3) of Algorithm (1) provides the reductions ofr(ac)to r(ac) ⋉ aδ(a) andm1(bc) to m1(bc) ⋉ bδ(b). It allows forelimination of a materializationm2(abc). Application of step(5) and later on the repetion of steps (1), (2), and (3) providesthe reductionss(bc) ⋉ bδ(b) and t(bd) ⋉ bδ(b). It allows forelimination of a materializationm1(bc).

Algorithm (1) can be used for generation of all alternativedata integration plans for processing of all arguments ofdata integration expressions. An important problem is whena materialization should be removed from a data integrationplan, or speaking in another way, when a plan that uses amaterialization should be replaced with another plan that doesnot use a materialization when processing the increments ofthe same argument.

A decision whether a materialization must be deleted de-pends on time spent on its maintenance, i.e. time spent onrecomputation of a materialization after one of its argumentshas changed. A more efficient way to ”refresh” materializationis to integrate the previous state of materialization with theincrements of data ”passing” through ”materialization node”in a syntax tree of data integration plan. Then, eliminationofmaterialization simply depends on the amounts of incrementsof data to be integrated with a materialization. If such amountsof data exceed a given threshold in given fixed period of timethen an alternative plan that does not use the materializationmust be considered. If due to the large processing costs amaterialization must be removed from a left/right deep dataintegration plan then all materializations ”located” above the

materialization considered in a left/right deep syntax tree mustbe eliminated. This is because in left/right deep syntax treesthere is only one path of data processing from the leaf nodesto a root node and the increments of the arguments locatedat the higher levels of the tree add to the increments comingfrom the arguments at the lower levels of the tree. It meansthat at any moment of data integration process there is thetopmost materialization in a syntax tree still beneficial forthe processing and whenever it is possible all materializationsabove it are not used for data integration.

Of course it may happen that the amounts of data passingthrough a ”materialization node” drop below a threshold andaplan that involves such materialization and all materializationsabove must be restored. In order to quickly restore the presentstate of materialization without recomputing it from scratchwe record the increments data passing through the ”material-ization nodes”. The saved increments are integrated with thelatest state of materialization to get its most up-to-date state.Elimination and restoration of materializations is controlled bythe following algorithm.

Algorithm (2)

(1) Consider a left/right deep syntax tree of data integrationplan where the materializationsm1, m2, . . . , mk are lo-cated along the edges of the tree starting from the lowestedge in the tree. The amounts of data that have to beintegrated with the materializations in a given period oftime τ are recorded at each materialization node. Initially,all materializations are empty and all materialization aremaintained in the data integration plans.

(2) At the end of every period of timeτ check if the amountsof data to be integrated withm1, m2, . . . , mk do notexcess a treshold valuedmax. If the amounts of datathat have to be integrated with a materializationmi

exceeddmax in the latest period of timeτ then wheneverit is possible the plans that use the materializationsmi, mi+1, ...mk are replaced with the plans that do notuse these materializations. Additionally, the incrementsofdata δ

(1)i , δ

(2)i , . . . , δ

(1)i+1, δ

(2)i+1, . . . , δ

(1)k , δ

(2)k , . . . passing

through the ”materialization nodes”mi, mi+1, ...mk arerecorded by the system.

(3) If the amounts of data passing through a ”materializationnode” mj , i > j increase abovedmax then the material-izationsmj , mj+1, . . .mi−1 must be removed from dataintegration plans in the same way as in a step (2).

(4) If the amounts of data passing through a ”materializationnode” mj , i < j increase abovedmax then the plans arenot changed.

(5) If the amounts of data passing through a ”materializa-tion node” mj , j > i decrease belowdmax then thecurrent states of materializationsmj , mj−1, . . . , mi mustbe restored from the recorded sequences of incrementsδ(1)j , δi

(2), . . . , δ(1)j−1, δ

(2)j−1, . . . , δ

(1)i , δ

(2)i , . . . and the old

states of the materializations.(6) If the amounts of data passing through a ”materialization

node” mj , j < i decrease belowdmax and the amounts


125

of data passing through the materialization nodesmj andabove do not change then the plans are not changed.

Time complexity of the algorithm isO(n) wheren is the totalnumber of operation nodes in left/right deep syntax tree of dataintegration experession. The algorithm sequentially updatesthe total amounts of data passing through the materializationnodes at the end of period of timeτ . Whenever a change ofexecution plan is required the mew plans are taken from atable, which is also sequentially searched.

VI. SUMMARY, OPEN PROBLEMS

Elimination of materializations from data integration plansis required when the maintenance of materializations becomestoo time consuming due to the increased intensity of dataincrements passing through the ”materialization nodes” inthedata integration tree. Then, it is worth to replace the currentplans with the new ones that do not use the materializations.This work shows how to construct the left/right deep dataintegration plans that do not use given materializations andwhen construction of such plans is possible. In particular,wedescribe an algorithm that generates all data integration plansfor a given data integration expression and a given set ofarguments. We also show when a materialization cannot beremoved from a data integration plans. Next, we propose aprocedure that dynamically changes data integration plansina reply to the increasing costs of maintenance of the selectedmaterializations.

Data integration plans considered in this work are limitedto left/right deep plans, i.e. the plans whose syntax tree isleft/right deep syntax tree. In a general case, some of thedistributed database applications do not have left/right deepplans or their ”bushy” plans cannot be transformed into theequivalent left/right deep plans. More research is needed toconsider elimination of materializations from ”bushy” dataintegration plans.

Another area, that still need more research is more preciseevaluation of the costs and benefits coming from eliminationof materializations. The algorithm proposed in this workconsiders only the benefits coming from elimination of dataintegration at materialization ”maintenance nodes”. The costsinclude the additional operations that must be performed onthe increments and other arguments of data integration plans.An interesting problem is what happens when a materializa-tion must be restored due to changing intensity of arrivingincrements of data. The costs involved are not included intothe balance of costs and benefits in the current model. It is alsointeresting how the materializations can be restored to themostup to date state in a more efficient way than by re-applyingthe stored modifications.

REFERENCES

[1] V. Srinivasan and M. J. Carey, “Compensation-based on-line queryprocessing,” inProceedings of the 1992 ACM SIGMOD InternationalConference on Management of Data, 1992, pp. 331–340.

[2] F. Ozcan, S. Nural, P. Koksal, C. Evrendilek, and A. Dogac, “Dynamicquery optimization in multidatabases,”Bulletin of the Technical Com-mittee on Data Engineering, vol. 20, pp. 38–45, March 1997.

[3] R. L. Cole and G. Graefe, “Optimization of dynamic query evaluationplans,” in Proceedings of the 1994 ACM SIGMOD International Con-ference on Management of Data, 1994.

[4] N. Kabra and D. J. DeWitt, “Efficient mid-query re-optimization ofsub-optimal query execution plans,” inProceedings of the 1998 ACMSIGMOD International Conference on Management of Data, 1998.

[5] J. Chudziak and J. R. Getta, “On efficient query evaluation in multi-database systems,” inSecond International Workshop on Advances inDatabase and Information Systems, ADBIS’95, 1995, pp. 46–54.

[6] J. R. Getta and S. Sedighi, “Optimizing global query processing plansin heterogeneous and distributed multi database systems,”in 10th Intl.Workshop on Database and Expert Systems Applications, DEXA1999,1999, pp. 12–16.

[7] J. R. Getta, “Query scrambling in distributed multidatabase systems,”in 11th Intl. Workshop on Database and Expert Systems Applications,DEXA 2000, 2000.

[8] T. Urhan, M. J. Franklin, and L. Amsaleg, “Cost based query scramblingfor initial delays,” in SIGMOD 1998, Proceedings ACM SIGMODInternational Conference on Management of Data, June 2-4, 1998,Seattle, Washington, USA, 1998, pp. 130–141.

[9] L. Amsaleg, J. Franklin, and A. Tomasic, “Dynamic query operatorscheduling for wide-area remote access,”Journal of Distributed andParallel Databases, vol. 6, pp. 217–246, 1998.

[10] T. Urhan and M. J. Franklin, “Dynamic pipeline scheduling for im-proving interactive performance of online queries,” inProceedings ofInternational Conference on Very Large Databases, VLDB 2001, 2001.

[11] R. Avnur and J. M. Hellerstein, “Eddies: Continuously adaptive queryprocessing,” inProceedings of the 2000 ACM SIGMOD InternationalConference on Management of Data, 2000, pp. 261–272.

[12] T. Urhan and M. J. Franklin, “Xjoin: A reactively-scheduled pipelinedjoin operator,”IEEE Data Engineering Bulletin 23(2), pp. 27–33, 2000.

[13] P. J. Haas and J. M. Hellerstein, “Ripple joins for online aggregation,” inSIGMOD 1999, Proceedings ACM SIGMOD Intl. Conf. on Managementof Data, 1999, pp. 287–298.

[14] Z. G. Ives, D. Florescu, M. Friedman, A. Y. Levy, and D. S.Weld, “Anadaptive query execution system for data integration,” inProceedings ofthe 1999 ACM SIGMOD International Conference on ManagementofData, 1999, pp. 299–310.

[15] M. F. Mokbel, M. Lu, and W. G. Aref, “Hash-merge join: A non-blocking join algorithm for producing fast and early join results,” 2002.

[16] G. Antoshenkov and M. Ziauddin, “Query processing and optmizationin oracle rdb,”VLDB Journal, vol. 5, no. 4, pp. 229–237, 2000.

[17] V. Raman, A. Deshpande, and J. M. Hellerstein, “Using state modulesfor adaptive query processing,” inProceedings of the 19th InternationalConference on Data Engineering, 2003, pp. 353–.

[18] J. R. Getta, “On adaptive and online data integration,”in Intl. Workshopon Self-Managing Database Systems, 21st Intl. Conf. on DataEngineer-ing, ICDE’05, 2005, pp. 1212–1220.

[19] ——, “Optimization of online data integration,” inSeventh InternationalConference on Databases and Information Systems, 2006, pp. 91–97.

[20] ——, “Static optimization of data integration plans in global informationsystems,” in13th International Conference on Enterprise InformationSystems, June 2011, pp. 141–150.

[21] ——, “Optimization of task processing schedules in distributed informa-tion systems,” inInternational Conference on Informatics Engineeringand Information Science, 2011, November 2011.

[22] L. Bouganim, F. Fabret, and C. Mohan, “A dynamic query processingarchitecture for data integration systems,”Bulletin of the TechnicalCommittee on Data Engineering, vol. 23, no. 2, pp. 42–48, June 2000.

[23] G. Graefe, “Dynamic query evaluation plans: Some course corrections?”Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 2,pp. 3–6, June 2000.

[24] J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande,K. Hildrum, S. Madden, V. Raman, and M. A. Shah, “Adaptive queryprocessing: Technology in evolution,”Bulletin of the Technical Commit-tee on Data Engineering, vol. 23, no. 2, pp. 7–18, June 2000.

[25] Z. G. Ives, A. Y. Levy, D. S. Weld, D. Florescu, and M. Friedman,“Adaptive query processing for internet applications,”Bulletin of theTechnical Committee on Data Engineering, vol. 23, no. 2, pp. 19–26,June 2000.

[26] A. Gounaris, N. W. Paton, A. A. Fernandes, and R. Sakellariou,“Adaptive query processing: A survey,” inProceedings of 19th BritishNational Conference on Databases, 2002, pp. 11–25.


126

A variable neighbourhood search heuristicfor the design of codes

R. Montemanni, M. SalaniIstituto Dalle Molle di Studi sull’Intelligenza ArtificialeScuola Universitaria Professionale della Svizzera Italiana

Galleria 2, 6928 Manno, Canton Ticino, SwitzerlandEmail: roberto, [email protected]

D.H. Smith, F. HuntDivision of Mathematics and Statistics

University of GlamorganPontypridd CF37 1DL, Wales, United Kingdom

Email: dhsmith, [email protected]

Abstract—Codes play a central role in information theory.A code is a set of words of a given length from a givenalphabet of symbols. The words of a code have to fulfil someapplication-dependent constraints in order to guarantee someform of separation between the words. Typical applications ofcodes are for error-correction following transmission or storage ofdata, or for modulation of signals in communications. The targetis to have codes with as many codewords as possible, in order tomaximise efficiency and provide freedom to engineers designingthe applications. In this paper a variable neighbourhood searchframework, used to construct codes in a heuristic fashion, isdescribed. Results on different types of codes of practical interestare presented, showing the potential of the new tool.

Index Terms—Code design, heuristic algorithms, variableneighbourhood search.

I. INTRODUCTION

Code design is a central problem in the field of infor-mation theory. A code is a set of words of a given length,composed from a given alphabet, and with some application-dependent characteristics that typically guarantee some formof separation between the words. Codes are usually adoptedfor error correction of data, or for modulation of signalsin communications. Codes have also found use in somebiological applications recently [1]. The target is normally tohave codes with as many words as possible. In the engineeringapplications this maximises efficiency and provides engineerswith the maximum possible freedom when designing commu-nication systems or other specific applications, as described inthe second part of this paper. This choice of target makes itnatural to formalise the problem as a combinatorial optimisa-tion problem. Depending on the underlying real applications,several types of code can be of practical interest.

Many approaches to solve these problems have been pro-posed in recent decades. So far, most research effort toconstruct good codes has been based on abstract algebraand group theory [2], [3], while only a marginal explorationof heuristic algorithms has been carried out. In fact codesfor error correction do need an algebraic construction toensure efficient decoding. This is not the case in some otherapplications, for which heuristic techniques can be used.

In this paper a set of heuristic algorithms will be de-scribed, and results obtained with them on some code designproblems will be summarised. These problems, which have

been previously studied in the literature, are used in differentpractical applications. The paper demonstrates that heuristicsare a valuable additional tool that can be successfully used indesigning good codes.

II. CODE DESIGN PROBLEMS

A code is a set of words of a given length defined over agiven alphabet that fulfils some defined properties. The mosttypical constraint is on the Hamming distance between eachpair of words. The Hamming distance d(x,y) between twowords x and y is defined as the number of positions in whichthey differ. The minimum distance of a code is the minimumHamming distance between any pair of words of the code.Some side-constraints that depend on the specific applicationfor which codes are defined, are also present. The objectiveof the problem is to find a code that fulfils all the constraintsand contains the maximum possible number of words.

Code design problems can easily be described in termsof combinatorial optimisation, making it possible to applyheuristic optimisation algorithms to them.

III. A VARIABLE NEIGHBOURHOOD SEARCH FRAMEWORK

A Variable Neighbourhood Search (VNS) algorithm [4] thatcombines a set of local search routines is presented. First, thelocal searches embedded in the algorithm are briefly described.The interested reader is referred to [5] for more details.

A. Seed Building

A simple heuristic method to build codes examines allpossible words in a given order, and incrementally acceptswords that are feasible with respect to already accepted ones.The Seed Building method is built on these orderings, whichare combined with the concept of seed words [6]. These seedwords are an initial set of feasible words to which words areadded in a given (problem dependent) order if they satisfy thenecessary criteria.

The set of seeds is initially empty, and one feasible randomseed is added at a time. If the new seed set leads to goodresults when a code is built from it, the seed is kept and a newrandom seed is designated for testing. This increases the sizeof the seed set. The same rationale, which is based on somesimple statistics, is used to decide whether to keep subsequent


127

seeds or not. If after a given number of iterations the qualityof the solutions provided by a set of seeds is judged to benot good enough, the most recent seed is eliminated from theset, which results therefore in a reduction in the size of theseed set. In this way the set of seeds is expanded or contracteddepending on the quality of the solutions built using the setitself. What happens in practice is that the size of the seed setoscillates through a range of small values. The algorithm isstopped after a given time has elapsed.

B. Clique Search

The idea at the basis of this local search method is thatit is possible to complete a partial code in the best possibleway by solving a maximum clique problem (see [7], [8]).More precisely, given a code, a random subset of the words isremoved, leaving a partial code. It is possible to identify all thefeasible words compatible with those already in the code, andbuild a graph from these words, where words are representedby vertices. Two vertices are connected if and only if the pairof words respect all of the constraints considered. It is thenpossible to run a maximum clique algorithm on the graph inorder to complete the partial code in the best possible way.Heuristic or exact methods can be used to solve the maximumclique problem. In the implementation described here an exactor truncated version of the algorithm presented in [9] is used.The search is run repeatedly, with different random subsets.The algorithm is stopped after a given time has elapsed.

C. Hybrid Search

This Hybrid Search method merges together the main con-cepts at the basis of the two local search algorithms describedin Sections III-A and III-B. There is a (small) set SeedSetof words, that play the role of the seeds of algorithm SeedBuilding. A set of words which are compatible with theelements of SeedSet (as in Clique Search), which are alsocompatible with each other in a weaker form, are identifiedand saved in a set V . More precisely, there is a parameter µ

which is used to model the concept of weak compatibility: thewords in V have to be compatible with each other accordingto a relaxed distance d′ = d−µ .

Weak compatibility is used to keep the set V at reasonablesizes, even in case of a very small set of seeds. A compatibilitygraph, for the creation of which the original distance d isused, is built on the vertex set V , as described in SectionIII-B. A maximum clique problem is solved on this graph. Themechanism described in Section III-A for the management ofseed words, is adopted unchanged here. In this way the setSeedSet is expanded and contracted during the computation.The algorithm is stopped after a given time has elapsed.

D. Iterated Greedy Search

This method is different from the local search approachespreviously described in the way solutions are handled. Thealgorithms described in the previous sections maintain a setof feasible words (i.e. respecting all the constraints) and try toenlarge this set. The Iterated Greedy Search method, which

is inspired by the method discussed in [10], works on aninfeasible set of words (i.e. not all of the words are compatiblewith each other, according to the constraints). The methodevolves by modifying words of a current solution S withthe target of reducing a measure In f (S) of the constraintviolations. If no violation remains, then a feasible solutionhas been retrieved.

In more detail, the local search method works as follows.An (infeasible) solution S is created by replacing a givenpercentage of the words of a given feasible solution by ran-domly generated words. A random word is added to solutionS. The following operations are then repeated until a feasiblesolution has been retrieved (i.e. In f (S)= 0), or a given numberof iterations has been spent without improvement. A wordcw of solution S is selected at random, and the change ofone of its coordinates that guarantees the maximum decreasein the infeasibility measure is selected (ties among possiblemodifications of cw are broken randomly). The word cw isthen modified accordingly.

When a feasible solution is retrieved, it is saved and theprocedure is repeated, starting from the new solution, other-wise the last feasible solution is restored and the procedure isapplied to this solution. The algorithm is stopped after a giventime has elapsed.

E. A Variable Neighbourhood Search approach

Variable Neighbourhood Search (VNS) methods have beendemonstrated to perform well and are robust (see [4]). Suchalgorithms work by applying different local search algorithmsone after the other, aiming at differentiating the characteristicsof the search-spaces visited (i.e. changing the neighbourhood).The rationale behind the idea is that combining togetherdifferent local search methods, that use different optimisationlogics, can lead to an algorithm capable of escaping from thelocal optima identified by each local search algorithm, withthe help of the other local search methods.

In our context, some of the local search methods previouslydescribed are applied in turn, starting each time from thebest solution retrieved since the beginning (or from an emptysolution, in the case of Seed Building). The algorithm isstopped after a given time has elapsed.

IV. CONSTANT WEIGHT BINARY CODES

A constant weight binary code is a set of binary vectors oflength n, weight w and minimum Hamming distance d. Theweight of a binary vector (or word) is the number of 1’s inthe vector. The minimum distance of a code is the minimumHamming distance between any pair of words. The maximumpossible number of words in a constant weight code is referredto as A(n,d,w).

Apart from their important role in the theory of error-correcting codes, constant weight codes have also found appli-cation in fields as diverse as the design of demultiplexers fornano-scale memories [11] and the construction of frequencyhopping lists for use in GSM networks [12].


128

TABLE IIMPROVED CONSTANT WEIGHT BINARY CODES.

Problem Old New Problem Old New Problem Old New Problem Old New Problem Old NewLB LB LB LB LB LB LB LB LB LB

A(41,6,5) 755 779a A(48,6,6) 7845 7869a A(31,8,7) 363 375f A(51,10,6) 60 64d A(40,10,8) 318 324d

A(42,6,5) 817 841b A(49,6,6) 8568 8605b A(32,8,7) 403 418f A(52,10,6) 60 67a A(41,10,8) 353 362d

A(43,6,5) 874 910b A(50,6,6) 9348 9380b A(33,8,7) 444 466a A(53,10,6) 63 70d A(42,10,8) 390 398a

A(44,6,5) 941 975b A(51,6,6) 10175 10210b A(34,8,7) 498 516a A(54,10,6) 65 73a A(43,10,8) 432 445f

A(45,6,5) 1009 1030b A(33,8,5) 44 45d A(35,8,7) 555 570a A(55,10,6) 68 73h A(44,10,8) 484 487a

A(46,6,5) 1097 1114a A(38,8,6) 231 236b A(36,8,7) 622 637b A(56,10,6) 70 79d A(45,10,8) 532 544c

A(47,6,5) 1172 1181b A(39,8,6) 252 254b A(37,8,7) 696 718a A(57,10,6) 70 83d A(46,10,8) 590 595a

A(48,6,5) 1254 1269b A(40,8,6) 275 281b A(38,8,7) 785 795b A(58,10,6) 72 85b A(47,10,8) 642 656e

A(49,6,5) 1343 1347c A(41,8,6) 294 297f A(39,8,7) 869 893a A(59,10,6) 77 87c A(48,10,8) 711 720e

A(50,6,5) 1429 1459a A(43,8,6) 343 347f A(40,8,7) 977 999a A(60,10,6) 79 91a A(49,10,8) 776 785e

A(51,6,5) 1517 1543b A(44,8,6) 355 381f A(41,8,7) 1095 1110a A(61,10,6) 83 94a A(50,10,8) 852 858e

A(52,6,5) 1617 1654a A(45,8,6) 381 403f A(42,8,7) 1206 1227b A(62,10,6) 84 98c A(51,10,8) 929 934e

A(53,6,5) 1719 1758c A(46,8,6) 411 432f A(43,8,7) 1347 1365a A(29,10,7) 37 39d A(52,10,8) 1007 1018e

A(54,6,5) 1822 1840f A(47,8,6) 440 463c A(44,8,7) 1478 1503a A(36,10,7) 75 78c A(55,10,8) 1289 1296e

A(55,6,5) 1936 1948b A(48,8,6) 477 494b A(45,8,7) 1639 1653f A(42,10,7) 133 137c A(56,10,8) 1405 1408a

A(32,6,6) 1353 1369a A(49,8,6) 501 527f A(46,8,7) 1795 1813f A(56,10,7) 351 358b A(32,12,7) 9 10A(33,6,6) 1528 1560a A(50,8,6) 542 567f A(47,8,7) 1987 2001f A(57,10,7) 366 374b A(33,12,7) 10 11A(34,6,6) 1740 1771b A(51,8,6) 576 606f A(48,8,7) 2173 2197f A(58,10,7) 394 399b A(36,12,7) 15 16A(35,6,6) 1973 1998b A(52,8,6) 609 640c A(49,8,7) 2376 2399b A(59,10,7) 414 423b A(37,12,7) 16 17A(36,6,6) 2240 2264b A(53,8,6) 650 687c A(50,8,7) 2603 2615f A(60,10,7) 431 449b A(39,12,7) 19 21A(37,6,6) 2539 2560f A(54,8,6) 682 726b A(51,8,7) 2839 2866f A(61,10,7) 458 474b A(40,12,7) 20 22b

A(38,6,6) 2836 2860b A(55,8,6) 729 768f A(52,8,7) 3101 3118f A(62,10,7) 486 497b A(37,12,8) 40 42d

A(39,6,6) 3167 3208a A(56,8,6) 766 815f A(53,8,7) 3376 3384f A(63,10,7) 514 526b A(38,12,8) 40 45a

A(40,6,6) 3545 3575a A(57,8,6) 830 866f A(54,8,7) 3651 3667b A(30,10,8) 92 93d A(43,14,8) 10 12A(41,6,6) 3964 3983a A(58,8,6) 872 912f A(55,8,7) 3941 3989b A(33,10,8) 134 140d A(44,14,8) 12 13A(42,6,6) 4397 4419b A(59,8,6) 935 965f A(56,8,7) 4270 4318b A(34,10,8) 156 162a A(45,14,8) 12 15A(43,6,6) 4860 4890b A(60,8,6) 982 1019f A(59,8,7) 5384 5386f A(35,10,8) 176 182d A(46,14,8) 13 17A(44,6,6) 5378 5414a A(61,8,6) 1028 1077f A(45,10,6) 49 50a A(36,10,8) 198 205d A(47,14,8) 15 18A(45,6,6) 5933 5959b A(62,8,6) 1079 1130f A(48,10,6) 56 57a A(37,10,8) 223 230d A(48,14,8) 18 19A(46,6,6) 6521 6552a A(63,8,6) 1143 1195f A(49,10,6) 56 59f A(38,10,8) 249 259a

A(47,6,6) 7160 7194a A(30,8,7) 327 340f A(50,10,6) 56 62f A(39,10,8) 285 291d

a Later improved in [13] by a specific group of automorphisms or a combinatorial construction.b Later improved in [13] by heuristic polishing of a group code or a combinatorial construction.c Later improved in [13] by shortening a code of length n+1 and weight w.d Later improved in [13] by a cyclic group.e Later improved in [13] by shortening a code of length n+1 and weight w or w+1 and a heuristic improvement.f Later improved in [13] by an unspecified method.

A VNS algorithm combining Seed Building and CliqueSearch (see Section III) was proposed in [8], and it wasshown to be able to improve best-known results from theliterature for many instances with parameter settings appro-priate to the frequency hopping applications (29 ≤ n ≤ 63and 5 ≤ w ≤ 8 with d = 2w− 2, d = 2w− 4 or d = 2w− 6),for which mathematical constructions were not very welldeveloped previously (the interested reader is referred to [8]for a detailed description of parameter tuning and experimentalsettings). The instances improved with respect to the state-of-the-art are summarised in Table I, where the new lowerbounds provided by the VNS method (New LB) are comparedwith the previously best-known results (Old LB). Many resultswere improved, especially for large values of n. These wereinstances for which the previous methods used were not par-ticularly effective. Notice that most of the instances reported inthe table were later further improved by other methods, manyof which again make use of heuristic criteria. See [13] for fulldetails.

V. QUATERNARY DNA CODES

Quaternary DNA codes are sets of words of fixed length nover the alphabet A, C, G, T. The words of a code haveto satisfy the following combinatorial constraints. For eachpair of words, the Hamming distance has to be at least d(constraint HD); a fixed number (here taken as bn/2c) of lettersof each word have to be either G or C (constraint GC); theHamming distance between each word and the Watson-Crickcomplement (or reverse-complement) of each word has to be atleast d (constraint RC), where the Watson-Crick complementof a word x1x2 . . .xn is defined as xnxn−1 . . .x1 with A = T ,T = A, C = G, G = C. If the number of letters which are Gor C in each word is bn/2c, then AGC

4 (n,d,bn/2c) is used todenote the maximum number of words in a code satisfyingconstraints HD and GC. AGC,RC

4 (n,d,bn/2c) is used to denotethe maximum number of words in a code satisfying constraintsHD, GC and RC.

Quaternary DNA codes have applications to informationstorage and retrieval in synthetic DNA strands. They areused in DNA computing, as probes in DNA microarray


129

technologies and as molecular bar codes for chemical libraries[5]. Constraints HD and RC are used to make unwantedhybridisations less likely, while constraint GC is imposed toensure uniform “melting temperatures”, where DNA meltingis the process by which double-stranded DNA unwinds andseparates into single strands through the breaking of hydrogenbonding between the bases. Such constraints have been used,for example, in [2], [14], where more detailed technicalmotivations for the constraints can be found. Lower boundsfor Quaternary DNA codes obtained using different tools suchas mathematical constructions, stochastic searches, template-map strategies, genetic algorithms and lexicographic searcheshave been proposed (see [7], [14], [2], [15], [10], [5], [16],[17] and [18].

A VNS method embedding all the local search routinesdescribed in Section III was implemented in [5], [16]. Exper-iments were conducted for AGC

4 (n,d,w) and AGC,RC4 (n,d,w)

with 4 ≤ n ≤ 20, 3 ≤ d ≤ n and 21 ≤ n ≤ 30, 13 ≤ d ≤ n. InTable II the new lower bounds (New LB) retrieved by VNSduring the experiments that improved the previous state-of-the-art results (Old LB) are summarised. It is interesting toobserve how substantial the improvements for this family ofcodes are sometimes (see, for example, AGC

4 (19,10,9) andAGC

4 (19,11,9)). The reader is referred to [16] for a detaileddescription of parameter tuning and experimental settings.

VI. PERMUTATION CODES

A permutation code is a set of permutations in the sym-metric group Sn of all permutations on n elements. Thewords are the permutations and the code length is n. Theability of a permutation code to correct errors is related tothe minimum Hamming distance of the code. The minimumHamming distance d is then the minimum distance taken overall pairs of distinct permutations. The maximum number ofwords in a code of length n with minimum distance d isdenoted by M(n,d).

Permutation codes (sometimes called permutation arrays)have been proposed in [19] for use with a specific modulationscheme for powerline communications. An account of therationale for the choice of permutation codes can be foundin [3]. Permutations are used to ensure that power outputremains as constant as possible. As well as white Gaussiannoise the codes must combat permanent narrow band noisefrom electrical equipment or magnetic fields, and impulsivenoise.

A central practical question in the theory of permutationcodes is the determination of M(n,d), or of good lower boundsfor M(n,d). The most complete contribution to this questionis in [3]. More recently, different methods, both based onpermutation groups and heuristic algorithms, have been pre-sented in [20]. In this paper a VNS approach involving CliqueSearch only (basically an Iterated Clique Search method)was introduced among other approaches. In some cases themethod was run on cycles of words of length n or n− 1instead of words. This reduces the complexity of the problem,making it tractable by the VNS approach. Experimental results

TABLE IIIMPROVED QUATERNARY DNA CODES.

Problem Old New Problem Old NewLB LB LB LB

AGC4 (7,3,3) 280 288 AGC,RC

4 (12,7,6) 83 87AGC

4 (7,4,3) 72 78g AGC,RC4 (12,8,6) 28 29

AGC4 (8,5,4) 56 63 AGC,RC

4 (12,9,6) 11 12AGC

4 (8,6,4) 24 28 AGC,RC4 (13,5,6) 3954 3974

AGC4 (9,6,4) 40 48 AGC,RC

4 (13,7,6) 205 206AGC

4 (9,7,4) 16 18 AGC,RC4 (13,8,6) 61 62

AGC4 (10,4,5) 1710 2016g AGC,RC

4 (13,9,6) 22 23AGC

4 (10,7,5) 32 34 AGC,RC4 (13,10,6) 9 10

AGC4 (11,7,5) 72 75 AGC,RC

4 (14,9,7) 46 49AGC

4 (11,9,5) 10 11 AGC,RC4 (14,10,7) 16 20

AGC4 (12,7,6) 179 183 AGC,RC

4 (14,11,7) 7 8AGC

4 (12,8,6) 68 118 AGC,RC4 (15,6,7) 6430 6634

AGC4 (12,9,6) 23 24 AGC,RC

4 (15,8,7) 343 347AGC

4 (13,9,6) 44 46 AGC,RC4 (15,9,7) 102 109

AGC4 (14,11,7) 16 17 AGC,RC

4 (15,10,7) 35 37AGC

4 (15,9,7) 225 227 AGC,RC4 (16,9,8) 230 243

AGC4 (15,11,7) 30 34 AGC,RC

4 (16,10,8) 74 83AGC

4 (15,12,7) 13 15 AGC,RC4 (17,9,8) 549 579

AGC4 (17,13,8) 22 24 AGC,RC

4 (17,10,8) 164 175AGC

4 (18,11,9) 216 282 AGC,RC4 (17,11,8) 56 62

AGC4 (18,13,9) 38 46 AGC,RC

4 (17,13,8) 11 12AGC

4 (18,14,9) 18 20 AGC,RC4 (18,9,9) 1403 1459

AGC4 (19,10,9) 1326 2047 AGC,RC

4 (18,10,9) 387 407AGC

4 (19,11,9) 431 615 AGC,RC4 (18,11,9) 104 133

AGC4 (19,12,9) 163 213 AGC,RC

4 (18,12,9) 43 49AGC

4 (19,13,9) 71 83 AGC,RC4 (18,13,9) 19 21

AGC4 (19,14,9) 33 38 AGC,RC

4 (18,14,9) 9 10AGC

4 (19,15,9) 15 17 AGC,RC4 (19,9,9) 3519 3678

AGC4 (20,13,10) 130 167 AGC,RC

4 (19,10,9) 909 960AGC

4 (20,14,10) 58 69 AGC,RC4 (19,11,9) 215 285

AGC4 (20,15,10) 31 33 AGC,RC

4 (19,12,9) 80 99AGC

4 (20,16,10) 13 16 AGC,RC4 (19,13,9) 35 39

AGC,RC4 (9,6,4) 20 21 AGC,RC

4 (19,14,9) 16 18AGC,RC

4 (10,5,5) 175 176 AGC,RC4 (19,15,9) 7 8

AGC,RC4 (10,7,5) 16 17 AGC,RC

4 (20,13,10) 64 77AGC,RC

4 (11,7,5) 36 37 AGC,RC4 (20,14,10) 29 33

AGC,RC4 (11,8,5) 13 14 AGC,RC

4 (20,15,10) 14 15AGC,RC

4 (12,5,6) 1369 1381 AGC,RC4 (20,16,10) 6 7

g Later improved in [16] by a heuristic approach based on anEvolutionary Algorithm.

(see [20] for details on parameter tuning and experimentalsettings) were discussed for 6 ≤ n ≤ 18 and 4 ≤ d ≤ 18,plus M(19,17) and M(20,19). The new best-known resultsretrieved by VNS (New LB) are summarised in Table III,where they are compared with the previous state-of-the-artresults (Old LB). Superscripts reflect the domain on which theVNS method was run. Beside providing the first non-trivialbound for some of the instances, the algorithm was also ableto provide substantial improvements over the previous best-known results (see, for example, M(15,13)).

VII. PERMUTATION CODES WITH SPECIFIED PACKINGRADIUS

Using the notation introduced in the previous section, a ballof radius e surrounding a word w∈Sn is composed of all the


130

TABLE IIIIMPROVED PERMUTATION CODES.


M(13,8) - 27132h M(15,14) - 56M(13,9) 3588 4810 M(16,13) - 1266M(13,10) - 906 M(16,14) - 269M(13,11) - 195i M(18,17) 54 70M(14,13) - 52 M(19,17) - 343M(15,11) - 6076h M(20,19) - 78M(15,13) 84 243h VNS on cycles of words of length n−1 instead of

words.i VNS on cycles of words of length n instead of words.

permutations of Sn with Hamming distance from w at moste. Given a permutation code C, the packing radius of C isdefined as the maximum value of e such that the balls ofradius e centred at words of C do not overlap. The maximumnumber of permutations of length n with packing radius atleast e is denoted by P[n,e].

From a practical point of view, a permutation code (see Sec-tion VI) with d = 2e+1 or d = 2e+2 can correct up to e errors.On the other hand, it is known that in an (n,2e) permutationcode the balls of radius e surrounding the codewords may allbe pairwise disjoint, but usually some overlap. Thus an (n,2e)permutation code is generally unable to correct e errors usingnearest neighbour decoding. On the other hand, a permutationcode with packing radius e (denoted [n,e]) can always correct eerrors. Thus, the packing radius more accurately specifies therequirement for an e-error-correcting permutation code thandoes the minimum Hamming distance [21].

A basic VNS algorithm involving Clique Search only (Iter-ated Clique Search) was presented, among other methods, in[21]. The method was tested on instances with 4≤ n≤ 15 and2≤ e≤ 6 (all parameter tunings and experimental settings aredescribed in the paper). The new best-known lower boundsretrieved by the VNS method (New LB) are summarised inTable IV, comparing it with the previous state-of-the-art bound(Old LB). Notice that also in this case superscripts reflect thedomain on which the VNS method was run. As in Section III,for complexity reasons, it was sometimes convenient to runthe method on cycles of words of length n or n− 1 insteadof words. From the results of Table IV it can be observedhow the improvements over the previous state-of-the-art aresometimes remarkable (see, for example, P[14,4]).

VIII. CONCLUSIONS

A heuristic framework based on Variable NeighbourhoodSearch for code design has been described. Experimentalresults carried out on four different code families, used indifferent applications, have been presented. Parameter tuninghas been carried out for all algorithms used for these appli-cations, and is described in the referenced papers. However,it has been observed that the exact choice of parameters isnot particularly critical. From the experiments it is clear thatheuristics are a valuable additional tool in the design of newimproved codes.

TABLE IVIMPROVED PERMUTATION CODES WITH A SPECIFIED

PACKING RADIUS.


P[5,2] 5 10 P[12,5] 60 144j

P[6,2] 18 30 P[12,6] - 12P[6,3] - 6 P[13,4] 4810 15120k

P[7,2] 77 126 P[13,5] 195 612k

P[7,3] 7 22 P[13,6] 13 40P[8,4] - 8 P[14,4] 6552 110682k

P[9,4] 9 25 P[14,5] 2184 3483P[10,4] 49 110j P[14,6] 52 169k

P[10,5] - 10 P[15,6] 243 769P[11,5] 11 33j VNS on cycles of words of length n instead of words.k VNS on cycles of words of length n−1 instead of

words.

ACKNOWLEDGMENT

R. Montemanni and M. Salani acknowledge the support ofthe Swiss Hasler Foundation through grant 11158: “Heuristicsfor the design of codes”.

REFERENCES

[1] M. K. Gupta, “The quest for error correction in biology,” IEEE Engi-neering in Medicine and Biology Magazine, vol. 25, no. 1, pp. 46–53,2006.

[2] O. D. King, “Bounds for DNA codes with constant GC-content,”Electronic Journal of Combinatorics, vol. 10, #R33, 2003.

[3] W. Chu, C. J. Colbourn and P. Dukes, “Constructions for permutationcodes in powerline communications,” Designs, Codes and Cryptography,vol. 32, pp. 51–64, 2004.

[4] P. Hansen and N. Mladenovic, “Variable neighbourhood search: prin-ciples and applications,” European Journal of Operational Research,vol. 130, pp. 449–467, 2001.

[5] R. Montemanni and D. H. Smith, “Construction of constant GC-contentDNA codes via a variable neighbourhood search algorithm,” Journal ofMathematical Modelling and Algorithms, vol. 7, pp. 311–326, 2008.

[6] A. E. Brouwer, J. B. Shearer, N. J. A. Sloane, and W. D. Smith, “Anew table of constant weight codes,” IEEE Transactions on InformationTheory, vol. 36, pp. 1334–1380, 1990.

[7] Y. M. Chee and S. Ling, “Improved lower bounds for constant GC-content DNA codes,” IEEE Transactions on Information Theory, vol. 54,no. 1, pp. 391–394, 2008.

[8] R. Montemanni and D. H. Smith, “Heuristic algorithms for construct-ing binary constant weight codes,” IEEE Transactions on InformationTheory, vol. 55, no. 10, pp. 4651–4656, 2009.

[9] R. Carraghan and P. Pardalos, “An exact algorithm for the maximumclique problem,” Operations Research Letters, vol. 9, pp. 375–382, 1990.

[10] D. C. Tulpan, H. H. Hoos, and A. E. Condon, “Stochastic local searchalgorithms for DNA word design,” Lectures Notes in Computer Science,Springer, Berlin, vol. 2568, pp. 229–241, 2002.

[11] P. J. Kuekes, W. Robinett, R. M. Roth, G. Seroussi, G. S. Snider, andR. S. Williams,“Resistor-logic demultiplexers for nanoelectronics basedon constant-weight codes” Nanotechnology, vol. 17, pp. 1052–1061,2006.

[12] J. N. J. Moon, L. A. Hughes, and D. H. Smith, “Assignment of frequencylists in frequency hopping networks,” IEEE Transactions on VehicularTechnology, vol. 54, no. 3, pp. 1147–1159, 2005.

[13] A. E. Brouwer, “Bounds for binary constant weight codes.http://www.win.tue.nl/∼aeb/codes/Andw.html.”

[14] P. Gaborit and O. D. King, “Linear construction for DNA codes,”Theoretical Computer Science, vol. 334, pp. 99–113, 2005.

[15] D. C. Tulpan and H. H. Hoos, “Hybrid randomised neighbourhoodsimprove stochastic local search for DNA code design,” Lecture Notes inComputer Science, Springer, Berlin, vol. 2671, pp. 418–433, 2003.


131

[16] R. Montemanni, D. H. Smith, and N. Koul, “Three metaheuristics for theConstruction of Constant GC-content DNA codes,” in: S. Voss and M.Caserta (eds.), Metaheuristics: Intelligent Decision Making, (OperationsResearch / Computer Science Interface Series). Springer-Verlag NewYork, 2011.

[17] D. H. Smith, N. Aboluion, R. Montemanni, and S. Perkins, “Linear andnonlinear constructions of DNA codes with Hamming distance d andconstant GC-content,” Discrete Mathematics, vol. 311, no. 14, pp. 1207–1219, 2011.

[18] N. Aboluion, D. H. Smith and S. Perkins, “Linear and nonlinearconstructions of DNA codes with Hamming distance d, constant GC-content and a reverse-complement constraint,” Discrete Mathematics,vol. 312, no. 5, pp. 1062–1075, 2012.

[19] N. Pavlidou, A.J. Han Vinck, J. Yazdani and B. Honary, “Power linecommunications: state of the art and future trends,” IEEE Communica-tions Magazine, vol. 41, no. 4, pp. 34–40, 2003.

[20] D. H. Smith and R. Montemanni, “A new table of permutationcodes,” Designs, Codes and Cryptography, Online First, 2011, DOI10.1007/s10623-011-9551-8.

[21] D. H. Smith and R. Montemanni, “Permutation codes with specifiedpacking radius,” Designs, Codes and Cryptography, Online First, 2012,DOI: 10.1007/s10623-012-9623-4.


132

Spatial Join with R-Tree on Graphics Processing Units

Tongjai Yampaka Department of Computer Engineering

Chulalongkorn University

Bangkok, Thailand

[email protected]

Prabhas Chongstitvatana Department of Computer Engineering


Bangkok, Thailand

[email protected]

Abstract: Spatial operations such as spatial join combine two

objects on spatial predicates. It is different from relational

join because objects have multi dimensions and spatial join

consumes large execution time. Recently, many researches

tried to find methods to improve the execution time. Parallel

spatial join is one method to improve the execution time.

Comparison between objects can be done in parallel. Spatial

datasets are large. R-Tree data structure can improve the

performance of spatial join.

In this paper, a parallel spatial join on Graphic processor

unit (GPU) is introduced. The capacity of GPU which has

many processors to accelerate the computation is exploited.

The experiment is carried out to compare the spatial join

between a sequential implementation with C language on

CPU and a parallel implementation with CUDA C language

on GPU. The result shows that the spatial join on GPU is

faster than on a conventional processor.

Keyword: Spatial Join, Spatial Join with R-tree, Graphic

processing unit

I. INTRODUCTION

The evolution of Graphic Processing Unit is driven by

the demand for real time, high-definition and 3-D

graphics. The requirement for an efficient and fast

computation has been met by parallel computation [1]. In

addition, GPU architecture that supports parallel

computation is programmable to solve other problems.

This new trend is called General Purpose computing on

Graphic processors (GPGPU). Developers can use the

capacity of GPU to solve other problem beside graphics

and can improve the execution time by parallel

computation. In a spatial database, storing and managing

complex and large datasets such as Graphic Information

system (GIS) and Computer-aided design (CAD) are time

consuming. A spatial database characteristic is different

from a relational database because of data type. Spatial

data types are point, line and polygon. The type of data

depends on the characteristic of objects, for example a

road is represented by a line or a city is represented by a

polygon. An object shape is created by x, y and z

coordinates. Therefore, spatial operations in a spatial

database are not the same as operations in a relational

database. There are specific techniques for spatial

operations.

Spatial join combines between two objects on spatial

predicates, for example, find intersection between two

objects. It is an expensive operation because spatial

datasets can be complex and very large. Their processing

cost is very high. To solve this problem R-Tree is used to

improve the performance for accessing data in spatial join.

Spatial objects are indexed by spatial indexing [2] [3].

The objects are represented by minimum bounding

rectangles which cover them. An internal node points to

children nodes that are covered by their parents. A leaf

node points to real objects. The join with R-Tree begins

with a minimum bounding rectangle. The test for an

overlap is performed from a root node to a leaf node. It is

possible that there are overlaps in sub-trees too.

The previous work [4] introduces a technique for

spatial join that can be divided into two steps.

• Filter Step: This step computes an approximation of

each spatial object, its minimum bounding rectangle. This

step produces rectangles that cover all objects.

• Refinement Step: In this step, spatial join predicates

are performed over each object.

Recently, spatial join techniques have been proposed

in many works. In a survey [5], many techniques to

improve spatial join are described. One technique shows a

parallel spatial join that improves the execution time for

this operation.

This paper presents a spatial join with R-Tree on

Graphic processing units. The parallel step is executed for

testing an overlap. The paper is organized as follow.

Section 2 explains the background and reviews related

works. Section 3 describes the spatial join with R-Tree on

Graphic processing units. Section 4 explains the

experiment. The results are presented in Section 5.

Section 6 concludes the paper.

II. BACKGROUND AND RELATED WORK

A. Spatial join with R-Tree

Spatial join combines two objects with spatial

predicates. Objects have multi-dimension so it is

important to efficiently retrieve data. In a survey [5],

techniques of spatial join are presented. Indexing data

such as R-Tree is one method which improves I/O time. In

[6], R-Tree is used for spatial join. Before executing a

spatial join predicate in the leaf level, an overlap between

two objects from parent nodes is tested. When parent

nodes are overlapped the search is continue into sub-trees

that are covered by its parents. The sub-trees which are

not overlapped from parent nodes are ignored. The reason

is that the overlapped parent nodes are probably


133

overlapped with leaf nodes too. The next step, the overlap

function test is called with sub-trees recursively. This

algorithm is shown in Figure 1

SpatialJoin(R,S):

For (all ptrS S) Do

For (all ptrR R with ptrR.rect ptrS.rect ≠ ) Do

If (R is a leaf node) Then

Output (ptrR , ptrS )

Else

Read (ptrR.child); Read (ptrS.child)

SpatialJoin(ptrR.child, ptrS.child)

End

End

End SpatialJoin;

Figure 1 Spatial join with R-Tree

The work [6] presents a spatial join with R-Tree that

improves the execution time. However, this algorithm is

designed for a single-core processor. The proposed

algorithm is based on this work but the implementation is

on Graphics Processing Units.

B. Parallel spatial join with R-Tree

To reduce the execution time of a spatial join, a

parallel algorithm can be employed. The work in [7]

describes a parallel algorithm for a spatial join. A spatial

join has two steps: filter step and refinement step. The

filter step uses an approximation of the spatial objects,

e.g. the minimum bounding rectangle (MBR).

The filter admits only objects that are possible to

satisfy the predicate. A spatial object is defined in the

form MBRi,IDi where i is a key-pointer data for the

object. The output of this step is the set

[MBRi,IDi,MBRj,IDj] if MBRi intersects with

MBRj. Each pair is called a candidate pair. The next step

is the refinement step. Pair of candidate objects is

retrieved from the disk for performing a join predicate. To

retrieve data, it reads the pointers from IDi and IDj. The

algorithm creates tasks for testing an overlap in the filter

step in parallel. For example in Figure 2, R and S denote

spatial relations.

The set R1,R2,R3,R4,R5,R6,…,RN is in R root and

the set S1,S2,S3,S4,S5,S6,…,SN is in S root. In the

algorithm described here the filter step is done in parallel.

Root R

Root S

R root = R1, R2, R3, R4, R5

S root = S1, S2, S3, S4

Task1 (R1,S1) Task2(R1,S2)

Task3 (R1,S3) Task4(R1,S4) Task created

…

TaskN (RN,SN) TaskN(RN,SN)

Figure 2 Filter task creation and distribution

in Parallel for R-tree join

The algorithm is designed for parallel operation on a

CPU. In this paper we use the same idea for the algorithm

but it is implemented on a GPU.

In other research [8], R-Tree is used in parallel search.

The algorithm distributes objects to separate sites and

creates index data objects from leaves to parents. Every

parent has entries to all sites. A search query such as

windows query can perform search in parallel.

C. Spatial query on GPU

For a parallel operation in GPU, the work in [9]

implements a spatial indexing algorithm to perform a

parallel search. A linear-space search algorithm is

presented that is suitable for the CUDA [1] programming

model. Before the search query begins, a preparation of

data array is required for the R-Tree. This is done on

CPU. Then the data array is loaded into device memory.

The search query is launched on GPU threads. The data

structure has two data arrays represented in bits. The

arithmetic at bit level is exploited. The first array stores

MBR co-ordinate referred to the bottom-left and top-right

co-ordinates of the i MBR in the index. The second array

is an array of R-Tree nodes. R-Tree nodes store the set

MBRi, childNode|t|. ChildNode|t| is an index into the

array representing the children of the node i. When the

search query is called, the GPU kernel creates threads to

execute the tasks. Then copy two data arrays to memory

on device. Finally the main function in GPU is called. The

algorithm is shown in Figure 3. The result is copied back

to CPU when the execution on GPU is finished.

Clear memory array (in parallel).

For each thread

if Search[i] is:

For each search[i] overlaps with the query MBR node j:

If the child node j is a leaf, mark it as part of the

output.

If the child node j is not a leaf, mark it in the

Next Search array.

Sync Threads

Copy next Search array into Search[i] (in parallel).

Figure 3 R-Tree Searches on GPU

III. IMPLEMENTATION

A. Overview of the algorithm

Most works have focused on the improvement of the

filter step. The first filter step assumes that the

computation is done with MBR of the spatial objects. In

this paper, this step is performed on CPU and the data set

is assumed to be in the data arrays. The algorithm begins

by parallel filtering objects on GPU. The steps of the

algorithm are as follows.

• Step 1: The data arrays required for the R-Tree are

mapped to the device memory. The data arrays are

prepared on CPU before sending them to device.

R1

R4

S1 R5

R2 S4

S2 S3 R3


134

A

D

B

E

C

A

D

B

E

C

• Step 2: Filtering step, a function to find an overlap

between two MBR objects is called. Threads are created

on GPU for execution in parallel. The results are the set of

MBRs which are overlapping.

• Step 3: Find leaf nodes, the results of step 2, the set

of MBRs, are checked whether they are in the leaf nodes

or not. If they are the leaf nodes, return the set as the

result and send them to the host. If they are not the leaf

nodes and then they are used as input again recursively

until reaching leaf nodes.

B. Data Structure in the algorithm

Assume MBRs objects are stored in a table or a file. In

the join operation, there are two relations denote as R and

S. MBRs structure (shown in C language syntax) are in

the form: Struct MBR_object

int min_x,max_x,min_y,max_y; ; /*x, y coordinate rectangle of object*/ Struct MBR_root

int min_x,max_x,min_y,max_y; child[numberOfchild];

; /*x, y coordinate rectangle of root*/ MBR_root rootR [numberOfrootR]; MBR_root rootS [numberOfrootS]; /*Array of rootR and rootS relation*/ MBR_object objectR [numberOfobjectR]; MBR_object objectS [numberOfobjectS]; /*Array of objectR and objectS relation*/

C. R-Tree Indexing

An R-Tree is similar to a B-Tree which the index is

recorded in a leaf node and it points to the data object [4].

All minimum bounding rectangles are created by x, y

coordinates of objects. The index of data is created by

packing R-Tree technique [10]. The technique is divided

into three steps:

1) Find the amount of objects per pack. The number of

child is between a lower bound (m) and an upper bound

(M) values.

2) Sort the data on x or y coordinates of rectangle.

3) Assign rectangles from the sort list to the pack

successively, until the pack is full. Then, find min x, y

and max x, y for each pack to create the root node.

Figure 4 MBRs before split node R-Tree

An example is shown in Figure 4. It has five

rectangles of objects. The objects are ordered according to

x-coordinate of the rectangle. The sorted list is A, D, B,

E, C. Define objects per pack as three. The assignments

of objects into packs are:

Pack1 = A, D, B

Pack2 = C, E

In the next step, a root is created. Compute min x,

min y and max x, max y.

Pack1 Pack2

Max y

R1 R2

Min y

Min x Max x

Figure 5 MBRs after split node R-Tree

The root node of pack1 is R1 and the root node of

pack2 is R2. R1 points to three objects: A, D and B. R2

points to two objects: C and E. The root coordinate is

computed from min x, min y max x, max y of all objects

which the root covers them. In the example, only one

relation is shown.

R-Tree creation is done on CPU. The difference is in

the spatial join operation. The spatial join on CPU is

sequential and on GPU is parallel.

D. Spatial join on GPU

To parallelize a spatial join, the data preparation is

carried out on CPU, such as MBRs calculation and

splitting R-Tree nodes. In GPU, the overlap function and

the intersection join function are executed in parallel.

1) Algorithm

• Overlap: This step is the filter step for testing the

overlap between root nodes R and S.

1. Load MBR data arrays (R and S) to GPU.

2. Test the overlap Ri and Sj in parallel.

3. The overlap function call is:

Overlap ((Sj.x_min < Ri.x_max)

and (Sj.x_max > Ri.x_min)

and (Sj.y_min < Ri.y_max)

and (Sj.y_max > Ri.y_min))

4. For each Ri overlap Sj

5. Find Ri and Sj children nodes.

• Find children: Find children nodes which are covered

by the root Ri and Sj.

a) The information from MBRs indicates the children

that are covered by the root.

b) Load children data and send them to the overlap

function.


135

• Test intersection: This is the refinement step. Compute

the join predicate on all children of Ri and Sj using the

overlap function above.

2) GPU Program Structure

CUDA C language is used. The language has been

designed to facilitate graphic rendering on Graphics

processing units. CUDA program has two phases [11]. In

the first phase, the program on CPU, called host code,

performs the data initialization and transfers data from

host to device or from device to host. On the second

phase, the program on GPU, called the device code,

makes use of the CUDA runtime system to generate

threads for execution of functions. All threads execute the

same code but operate on different data at the same time.

A CUDA function uses the keyword “__global__” to

define function that is a kernel function. When the kernel

function is called from the host, CUDA generates a grid of

threads on the device.

In the spatial join, the overlap function is distributed to

different blocks and is executed at the same time with

different data objects. To divide the task, every block has

a block identity calls blockIdx.

For example:

Objects Relation R = Robject0, Robject1, Robject2,..,RobjectN,

Relation S = Sobject0, Sobject1, Sobject2,…,SobjectN

Overlap function: Compare all objects. Find x and y

coordinates in the intersection predicate.

The sequential program on CPU executes only one

pair of data at the one time.

Robject0 compare Sobject0



...

RobjectN compare SobjectN..timeN

On GPU, the CUDA code on device generates blocks

for execution all data on different blocks.

Block0 = Robject0 compare Sobject0



...

BlockN = RobjectN compare SobjectN

The memory is allocated for execution between CPU

and GPU. First, allocate memory for data structure of root

R-Tree and MBRs of objects. Second, allocate memory of

data arrays to store results. When the task is done copy

data arrays back to host.

The nested loop is transformed to run in parallel. The

rectangle of objects are mapped to 2D block on GPU. The

outer loop is mapped to blockIdx.x and the inner loop is

mapped to threadIdx.y.

The call to kernel function is:

kernel<<<number of outer loop,number of inner loop>>>.

CUDA kernel generates blocks and threads for execution.

IV. EXPERIMENTATION

A. Platform

The spatial join is coded in C language for sequential

version. CUDA C language is used in parallel version.

Both versions run on Intel Core i3 2.93 GHz DDR3 2048

MB memory. GPU NVIDIA GT440 1092 MHz. 1024 MB

and CUDA 96 Cores.

B. Dataset

In the experiment, the dataset is retrieved from R-Tree

portal [12]. In the data preparation step the minimum

bounding rectangles are pre-computed. The attributes in

the dataset consist of Roads join River in Greece, Streets

join Real roads in Germany.

TABLE I DATASET IN EXPERIMENTATION

Pair of dataset Amount

MBRs

Data size

Greece

Rivers join Roads 47,918 0.7 MB

Germany

Streets join Real roads 67,008 0.6 MB

Table 1 shows the number of MBRs and the size of

dataset. All datasets are in text file. A C function is used

to read data from a text file to data arrays.

V. RESULT

Spatial join is tested with dataset in Table 1 with two

functions (Overlap function of root nodes and Intersection

function of children nodes). In the experiment, the time to

read data from text files and stores them to data arrays is

ignored. The execution time of spatial join operation

between CPU and GPU is compared. The generation of R-

Tree is done on CPU in both sequential and parallel

version. Only the spatial join operations are different.

A. Performance comparison between sequential and

parallel

The results are divided into two functions: overlap

and intersect.

TABLE II EXECUTION TIME ON GPU AND CPU

Pair of

dataset

Overlap

(ms)

Intersection

(ms)

Total

(ms)

CPU GPU CPU GPU CPU GPU

Greece

Rivers join

Roads

18 4 72.67 22.33 90.67 26.33

Germany Streets

join Real roads

5.33 4 74.00 39.67 79.33 43.67

The result in Table 2 shows that the execution time on

GPU is faster than on CPU. For the dataset 1, the overlap

function on GPU is 77.78% faster (4 ms versus 18 ms or

about 4x); the intersection function is 69.27% faster (3x).

The total execution time on GPU is 70.96% faster (3.4x).

For the dataset 2, the overlap function on GPU is 25%

faster (1.3x); the intersection function is 46.40%


136

faster (1.8x). The total execution time on GPU is 44.96%

faster (1.8x). The speedup depends on the data type as

well. If data has larger numbers, the execution time is

longer too. In the experiment, the dataset 1 is floating

point data. It has six digits per one element. Execution

time is higher than the dataset 2 because the dataset 2 has

integer data. It has four digits per one element.

The time to transfer data is significant. The data

transfer time affected the execution time. The total

running time in Table 2 includes the data transfer time

from host to device and device to host.

Figure 6 Transfer rate dataset 1, dataset 2

Figure 6 shows the data transfer rate on GPU. The

dataset 1 has 47,918 records and its size is 0.7 MB. The

data transfer time of this dataset is 59.53% of the

execution time. The dataset 2 has 67,008 records and is

0.6 MB. The data transfer time of this dataset is 76.83%

of the execution time.

VI. CONCLUSION

This paper describes how a spatial join operation with

R-Tree can be implemented on GPU. It uses the multi-

processing units in GPU to accelerate the computation.

The process starts with splitting objects and indexing data

in R-Tree on the host (CPU) and copies them to the device

(GPU). The spatial join makes use of the parallel

execution of functions to perform the calculation over

many processing units in GPU.

However using Graphic Processor Unit to perform

general purpose task has limitations. The symbiosis

between CPU and GPU is complicate. There is a need to

transfer data back and forth between CPU and GPU and

the data transfer time is significant. Therefore, it may be

the case that the data transfer time will dominate the total

execution time if the task and the data are not carefully

divided.

The future work will be on how to automate and

coordinate the task between CPU and GPU. There are

other database management functions that are suitable to

be implemented in GPU too. It is worth the investigation

as GPU becomes ubiquitous nowadays.

REFERENCE

[1] NVIDIA CUDA Programming Guide, 2010. Retrieve

at http://developer.download.nvidia.com

[2] A. Nanopoulos, A. N. Papadopoulos and Y.

Theodoridis Y. Manolopoulos, R-trees: Theory and

Applications, Springer, 2006.

[3] Xiang Xiao and Tuo Shi, "R-Tree: A Hardware

Implemention," Int. Conf. on Computer Design, Las

Vegas, USA, July 14-17, 2008.

[4] Gutman A., "R-tree:A Dinamic Index Structure for

Spatial Searching," ACM SIGMOD Int. Conf. , 1984.

[5] E.H. Jacox and H. Samet, "Spatial Join Techniques,"

ACM Trans. on Database Systems, Vol.V, No.N,

November 2006, Pages 1–45.

[6] Hans P. Kriegel and B. Seeger T. Brinkhoff, "Efficient

Processing of Spatial Joins Using R-tree," SIGMOD

Conference, 1993, pp.237-246.

[7] L. Mutenda and M. Kitsuregawa, "Parallel R-tree

Spatial Join for a Shared-Nothing Architecture," Int. Sym.

on Database Applications in Non-Traditional

Environments, Japan, 1999, pp.423-430.

[8] H. Wei, Z. Wei, Q. Yin, "A New Parallel Spatial

Query Algorithm for Distributed Spatial Database," Int.

Conf. on Machine Learning and Cybernetics, 2008, Vol.3,

pp.1570-1574.

[9] M. Kunjir and A. Manthramurthy, "Using Graphics

Processing in Spatial Indexing Algorithm", Research

report, Indian Institute of Science, 2009.

[10] K. Ibrahim and F. Cristos, "On Packing R-tree," Int.

Conf. on Information and knowledge management, ACM,

USA, 1993, pp.490-499.

[11] David B. Kirk and Wen-mei W. Hwu, Programming

Massively Parallel Processors A Hands-on Approach,

Morgan Kaufmann, 2010.

[12] R-tree Portal. [Online]. http://www.rtreeportal.org


137

http://developer.download.nvidia.com/

Ontology Driven Conceptual Graph Representationof Natural Language

Supriyo GhoshDepartment of Information Technology

National Institute of Technology,DurgapurDurgapur, West Bengal,713209,IndiaEmail:[email protected]

Prajna Devi UpadhyayDepartment of Information Technology

National Institute of Technology,DurgapurDurgapur, West Bengal,713209,India


Animesh DuttaDepartment of Information Technology

National Institute of Technology,DurgapurDurgapur, West Bengal,713209,India


Abstract—In this paper we propose a methodology to converta sentence of natural language to conceptual graph, which is agraph representation for logic based on the semantic networks ofArtificial Intelligence and the existential graph. A human beingcan express the same meaning in different form of sentences.Although many natural language interfaces(NLIs) have beendeveloped, but they are domain specific and require a hugecustomization for each new domain. From our approach acasual user can get more flexible interface to communicate withcomputer and less customization is required to shift from onedomain to another. Firstly, a parsing tree is generated from theinput sentence. From the parsing tree, each lexeme of the sentenceis found and the basic concepts matching with the ontology issorted out. Then relationship between them is found by consultingwith domain ontology and finally the conceptual graph is built.

I. I NTRODUCTION

Now a days it is a challenging work to develop a methodol-ogy by which a human being can communicate with computer.A human can communicate only by natural language, butcomputer can understand a formalized data structure likeconceptual graph. So, both can communicate with their propersemantics if they share a common vocabulary or ontology andthere exists a proper interface which can convert the naturallanguage into formalized data structure like conceptual graphand vice versa.

A. Conceptual Graph

A conceptual graph (CG)[1] is a graph representation forlogic based on the semantic networks of artificial intelligenceand the extential graphs of Charles Sanders Peirce[2]. Manyversion of conceptual graph have been designed and imple-mented for last thirty years. In the first published paper onCGs, [3] used them to represent the conceptual schemas usedin database system. The first book on CGs [1] applied them toa wide range of topics in artificial intelligence and computerscience. In [3] developed a version of conceptual graphs (CGs)as an intermediate language for mapping natural language toa relational database.

A conceptual graph is a bipartite graph of concept verticesalternate with (conceptual) relation vertices, where edges con-nect relation vertices to concept vertices [4]. Each conceptvertex, drawn as a box and labelled by a pair of a concepttype and a concept referent, represents an entity whose typeand referent are respectively defined by the concept type andthe concept referent in the pair. Each relation vertex, drawn asa circle and labelled by a relation type, represents a relationof the entities represented by the concept vertices connectedto it. Concepts connected to a relation are called neighbourconcepts of the relation.

B. Ontology

An ontology[5] is a conceptualization of an applicationdomain in a human-understandable and machine- readableform. It is used to reason about the properties of that domainand may be used to define that domain. As per definitionof Ontology , Ontology defines basic terms and relationcomprising the vocabulary of a topic area as well as therules for combining terms and relation to define extension tothe vocabulary [6], [7]. A survey of Web tools [8] presentedthat extraction ontologies provide resiliently and scalabilitynatively where in other approaches for information extraction,the problem of resiliently and scalability still remains. Oneserious difficulty in creating the ontology manually is the needfor a lot of time, effort and might contain errors. Also, itrequires a high degree of knowledge in both database theoryand Perl regular- expression syntax. Professional groups arebuilding metadata vocabularies or the ontologies. Large hand-built ontologies exist for example medical and geographicterminology. Researchers are rapidly working to build systemsto automate extracting them from huge volumes of text. Onemore complex problem is, no formalized rule is there to defineand build the ontology. In our work we have assumed thatall lexemas like noun,verb and adjectives of our experimentaldomain are defined as a concept or instance of a concept inthe domain ontology.


138

This paper is structured as follows. The Related Workand Scope of the Work are presented in section II and III.Our proposed system overview and demonstration throughexamples is shown in section IV. Case study for differentsentences are given in section V. Finally, we conclude anddraw future directions in Section VI.

II. RELATED WORK

A lot of methodologies have been developed to capture themeaning of a sentence by converting the natural languageinto its corresponding conceptual graph. But as there is noformalized rule build an ontology, it is still a challengingworkto convert the whole set of natural language into a formalizedmachine understandable language.

In [9], the authors have built the conceptual graph fromnatural language. But they have not defined any grammer orrule to generate parsing tree of any sentence. They also havenot provided any idea to keep the same semantics by buildingunique conceptual graph for different sentences with samemeaning.

In [10], [11], [12], the authors have proposed a method-ology to develop a tool to overcome the negetive effects ofparaphrases, by converting a complex formed sentence into asimple format. In this approach a complex format query of thedomain of interest which cannot be recognized by the system,is rearranged into a simple machine understandable format.But they lack of converting different forms of sentence intoasingle data structure like conceptual graph by consulting withits domain ontology.

In [13], [4], the authors have built a query conceptual graphfrom a natural sentence by identifying the concept whichneeds a high computational cost. As they have not parsed thesentence, the searching cost of proper ontological conceptisvery high.

In [14], the author’s approach for building conceptual graphfrom natural language depend on only the semantics of verbs,which is not feasible for all the cases. In many existingontologies, nouns and verbs both perform a very importantrole to capture the semantic of the sentence.

In [15], [16], a natural language query is converted intoSPARQL query language and by consulting with its domainontology, the system generates answer. But this approachcannot capture the semantic of question always as the systemdoes not consult the ontology concepts when SPARQL queryis built.

The work presented in [17], [18] is related to our proposedapproach. Here after syntactic parsing of sentence systemgenerates the ontological concepts. For unrecognized conceptssystem generates some suggession and from the user’s se-lection system learns about the ranking of the suggessionin future. But this approach does not give us any notionof building same conceptual graph from various forms ofsentences with same semantic.

III. SCOPE OFWORK

Now a days people are trying to use the computer or asoftware tool like agent for task delegation. So, language ofthe human being must be converted into some formal datastructure which can be understood by the computer. A numberof methodologies have been developed to convert a naturallanguage into conceptual graph. But the semantics of thelanguage cannot be defined by the conceptual graph, until weuse a common vocabulary between user and computer. Thiscommon vocabulary can be expressed in the form of ontology.A casual user can define a single sentence in various waysthough all of these have same semantic. So our approach is topresent a methodology which can convert a natural languageinto corresponding conceptual graph by consulting with itsdomain ontology. So both, user and computer can understandthe semantic of the conversation. Our approach builds theunique conceptual graph for various sentences in differentforms but same semantics. Similarly a single word has anumber of synonyms and all synonyms may not be defined indomain ontology. So if a particular concept cannot be found inontology our system must check whether any of its synonymis defined in the domain ontology as a concept. The synonymmust be identified from WordNet [19] ontology.

IV. SYSTEM OVERVIEW

In this work we develop a methodology by which froma natural language query or sentence, a conceptual graph isgenerated using the defined concept and relationship betweenthe concepts of the domain ontology. Using this approach acasual user and computer can communicate, if they share acommon ontology or vocabulary. We develop the methodol-ogy of converting a sentence into conceptual graph by thefollowing four steps:

1. Grammer for accepting natural language.2. Parsing tree generation.3. Recognizing the ontological concepts.4. Creating Conceptual Graph by using the ontological

concepts.

A. Grammer for accepting natural language

In this section we have defined a grammer which canrecognize simple,compound and complex sentences. Thisgrammer restricts the user to give the input sentence in acorrect grammatical format. We define a context free grammerG where

G = (VN ,Σ, P, S)where,Non-Terminal: VN=S,VP,NP,PP,V,AV,NN,P,ADJ,CONJ,DW,DHere,

S=Sentence, VP=Verb Phrase,NP=Noun Phrase, PP=Preposition Phrase,AV=Auxiliary Verb, V=Verb,NN=Noun, P=Preposition,ADJ=Adjective, CONJ=Conjunction,D=Determiner, DW=Depending words for complex

sentence.


139

Terminal: Σ=Any kind of valid english word likenoun,verb,auxiliary-verb, adjective,preposition, determinerand null(ǫ) also.

Production Rule: P is the production of the grammer.Every sentence recognized by this grammar must follow theseproduction rules. P consists of

1) For simple sentences: S⇒<NP><VP>VP⇒<V><NP>V ⇒<AV><V> | <V><CONJ><V>

NP⇒<D><ADJ><NN>< PP>PP ⇒< NP >

V ⇒ verb|ǫNN ⇒ noun|ǫAV ⇒ auxiliary − verb|ǫP ⇒ preposition|ǫADJ ⇒ adjective|ǫD ⇒ determiner|ǫ

2) For compound sentences: S⇒<S><CONJ><S>As compound sentence is build by joining of 2 simple

sentence using a conjunction.

3) For complex sentences:S⇒<S><DW><S> | <DW><S><,><S>

As complex sentence is formed by a simple independentsentence and a dependent sentence, where every dependentsentence starts with a dependent word. Now a complexsentence is formed in two ways.

1. If dependent sentence comes first then the sentence startswith a dependent word and two sentence must be separatedby a comma(,).

2. If dependent sentence comes last then the dependentword must separate the two sentence.In case of both complex and compound sentence,the individualsimple sentence(S) follows all the production rule of thesimple sentence.

Start symbol: A grammar has a single nonterminal (thestart symbol) from which all sentences are derived. Allsentences are derived from S by successive replacement usingthe productions of the grammar.Null symbol: it is sometimes useful to specify that a symbolcan be replaced by nothing at all. To indicate this, we usethe null symbolǫ, e.g., A ⇒ B|ǫ. In our defined grammarany non-terminal symbol except S,VP and NP has a nullproduction.

B. Parsing Tree Generation

Whenever a normal user gives any sentence as an inputto the system, a parsing tree is generated by using ourdefined grammer. So,from this parsing tree we can recognizenoun,verb,adjective,preposition of the given input sentence.For example if input sentence is ”John is going to Boston bybus”, a parsing tree must be generated like Figure: 1 by theseproduction rules:

S

NP VP

DET ADJ NN PP V NP

John ^ AV V P DET ADJ PP^^

is to ^ ^ BostonP NPgoing

DET ADJ NN PPby

^ ^ Bus ^

NN

Fig. 1. Parse tree for simple sentence

S

S CONJ S

NP VP NP VP

DET ADJ NN PP V NP DET ADJ NN PP V NP

The beautiful apple ^ AV V P DET

ADJ NN PP

^^ It ^ AV

V PP

^ is ^ ^ ^ red ^ ^is rotten

^

but

Fig. 2. Parse tree for compound sentence

S⇒<NP><VP>S⇒<DET><ADJ><NN><PP><VP>S⇒<NN><V><NP>S⇒<NN><AV><V><NP >

S⇒John is going to<DET><ADJ><NN><PP>S⇒John is going to<NN><NP>S⇒John is going to boston

<DET><ADJ><NN><PP>S⇒John is going to boston by<NN>

S⇒John is going to boston by busNow if a user gives a compound sentence as an input

the generation of parse tree starts from the production ofcompound sentence shown as

S⇒<S><CONJ><S>Therefore if a compound sentence like ”The beautiful apple

is red but it is rotten.” comes as an input the generated parsetree must be like Figure: 2 where each simple sentence isgenerated by the production rule of simple sentence.

Now if a complex sentence comes as a input to thesystem, the generation of parse tree follows the productionruleof complex sentence where both dependent and independentsentence must follow the production rule of simple sentence.The production starts from the basic production rule of com-plex sentence shown as:

S⇒<S><DW><S> | <DW><S><,><S>So, if a complex sentence like ”After the student go to the

class, he can give attendence.” comes as an input to the system,the generated parsing tree must be like Figure: 3.

From this parsing tree we can easily identify the typeof the sentence and each POS(Parts of Speech ) likenouns,verbs,adjective,preposition,determiner etc. In our next


140

S

DW S PUNC S

After NP VP , NP VP

DET ADJ NN PP V NP DET ADJ NN PP V NP

The ^ Student ^ AV V P DET

ADJ

NN pp ^ ^ He ^ AV V P DET ADJ NN PP

^ go to ^ ^ class ^ can give ^ ^ ^ attendance^

Fig. 3. Parse tree for complex sentence

START

Parse the sentence using defined rule and develop the parse tree

Find IC by identifying Noun,Verb and Adjectives.

For each IC do

If SC==OC

Generate suggesion of OC nearer to IC

User agree with

Lagend:

OC= OntologyConcept

IC= Identified Concept

SC= Synonym of IC

match that SC with Ontology Concepts.

NO

NO

The identified concept is out of domain ontology.

Suggested OC

Add the OC

to the Concept

List.

YES

YES

If IC=OCYES

NO

NO

YESFind OC

of that

IO IO= Instance or Individual

of Ontology

If IC=IO

Find synonym of identified concept from wordbnet and

Fig. 4. Flowchart for finding the ontological concepts of each lexeme of asentence

section we deal with these identified lexicons.

C. Recognizing the Ontological Concepts

This step involves finding the ontological concepts usedin the given input sentence. Our general assumption is thateach lexeme in the sentence is represented by using a separateconcept in the ontology, therefore all nouns, adjectives, verbsand pronouns are represented by identified concepts, while thedeterminers,numbers,prepositions and conjunctions are used asa referent of the relevant concept. Here we have defined analgorithm (Figure: 4) for finding the ontological concepts andinstances used in a given input sentence by syntactic mappingof each lexemes with each predefined ontological concepts.

As we have assumed that nouns,verbs and adjectives ofa particular domain are defined as a concept in its domainontology, we have identified nouns,verbs and adjectives fromthe parsing tree and put them into identified concept(IC) list.For each identified concept there may be 4 cases,

1. The Identified concept is identical to any domain ontologyconcept(OC).

2. The IC cannot be mapped to any OC, but any synonymof that IC is defined as an OC in the ontology.

3. The IC is defined as an instance or individual of a conceptin the ontology.

4. The IC is not in the domain of experiment, so it mustnot be recognized by the domain ontology.

In the next step the Identified Concepts must be convertedinto ontological concepts as computer can understand only thevocabulary of ontology. So, for each identified concept of IClist different operation is performed for the above 4 cases.

1. As IC is syntactically equal to a OC, this OC will beadded to ontological concepts list.

2. If the IC cannot be syntactically mapped with any OC,system checks all the synonyms of the IC from WordNet whereWordNet (Fellbaum, 1998) is an English lexical databasecontaining about 120 000 entries of nouns, verbs, adjectivesand adverbs, hierarchically organized in synonym groups(called synsets), and linked with relations, such as hypernym,hyponym, holonym and others. For each synonym concept(SC)of corresponding IC the system tries to syntactically map theSC with the OC. If any syntactically mapped OC is found,that OC must be added to the Ontological Concepts list.

3. If the IC is an instance or individual of an OntologicalConcept, system find the corresponding OC of the instanceand add that OC to Ontological Concepts list.

4. If the IC is not in the domain of experiment, systemcontinue the loop for next iteration with next IC from theIdentified Concepts list.

So, after getting the Ontological Concepts list system buildthe conceptual graph as discribed in the next section.

D. Creating Conceptual Graph from Ontological ConceptsList

In this section we propose an algorithm for generatingconceptual graph from the generated ontological concepts listwhich consists of the following four steps:

Step 1: In the concept list if two same concept occurs withsame instance name or with no instance, then we keep oneconcept with its instance name and discard the other one. Butif the same concept comes with different instance name, wekeep both concepts.

Let us consider the sentence ”India is a large country”comes as an input. After parsing the sentence three conceptsmust be added to ontological concepts list, ’country:india’,’large’ and ’country:*’. Then the system should merge the twocountry concepts into one, and update its ontological conceptsto ’Country:india’ and ’large’.

But if a sentence ”John is playing with Bob.” comesas input, after parsing the sentence three concepts ’Per-son:John’,’play’ and ’Person:Bob’ must be added to ontologi-cal concepts list. Though here two concepts have same nameas Person, but we keep both concepts, as the two conceptshave different instance names.

Step 2: As conceptual graph consists of concepts and


141

CAT Agent Sitting

Sitting MATLocation

Fig. 5. Forming of subconceptual graph

CAT MATLocationAgent Sitting

Fig. 6. Forming of desired final conceptual graph

relationship between the concepts, from the domain ontologythe system finds the exact relationship between two consec-utive concepts for each consecutive pair of concepts fromontological concepts list.

As an example if a sentence ”Cat is sitting on the mat.”comes as an input, by parsing the sentence three concepts’CAT’,’Sitting’ and ’MAT’ must be added to ontologicalconcepts list. Now in this step the system finds relationshipbetween each pair of consecutive concepts from the domainontology. Let in ontology the relationship is defined like that-’Agent’ is the relationship between ’CAT’ and ’Sitting’, and’Location’ is the relationship between ’Sitting’ and ’MAT’concepts.

Step 3: Make subconceptual graph for consecutive pairof concepts of ontological concepts list by connecting withthe identified relationship between those concepts defined inontology.

So, in our previous example of ”Cat is sitting on themat”, there are two pair of consecutive concepts. So, twosubconceptual graph must be formed which is shown inFigure: 5.

Step 4: Merge the subconceptual graphs by their commonconcept name and develop the final desired conceptual graphwhich must be recognized by any system which have thecommon domain ontology.

So, in the previos example of ”Cat is sitting on the mat”, thetwo subconceptual graphs must be merged by their commonconcept ’Sitting’ and build the desired final conceptual graphwhich is shown in Figure: 6.

V. CASE STUDY

In this work our basic goal is to develop a conceptualgraph from a natural language sentence given by a casualuser by keeping the actual semantics intact, as a computercan understand the semantic of a sentence representing it asaconceptual graph. Now the main problem is, a single sentencecan be represented in various ways though all of them havethe same semantic. So the conceptual graph for every sentencewith unique semantic must be identical. We present here someexample of this problem with three types of sentences, andevery time the same conceptual graph is formed.

APPLE Color:RedhasColor

Fig. 7. conceptual graph for simple sentence.

1) Case1 : For simple sentences: A casual user can givea simple sentence with same semantic but in various format.

1. The apple is red.2. The color of the apple is red.3. The apple is of red color.These three simple sentences have the same semantic but

in different form. After parsing the first sentence we identifytwo concept, ’Apple’ and ’Red’, where ’red’ is an instanceof ontological concept ’Color’ and ’Apple’ is another onto-logical concept. So OC list contains ’Apple’ and ’Color:red’concepts. For second and third sentence the identified conceptare ’Color’,’Apple’ and ’red’, where ’red’ is an instance of’Color’ concept. So the Concept ’Color:*’ and ’Color:red’must be merged as a single concept. So we take the concept’Color:Red’ and discard the other ’Color:*’ concept. FinallyOC list contains two concepts ’Apple’ and ’Color:red’. Soas OC list contains equal elements the developed conceptualgraph is also same for three sentences which is shown inFigure: 7.

2) Case2 : For compound sentences: A casual user cangive a compound sentence with same semantic but in differentformat.

1. John is happy and he is lucky.2. John is lucky and he glad too.These two compound sentence have the same semantic

but in different form. After parsing the first sentence weidentify that there are two simple sentences joined by ’and’.From the first simple sentence we identify two concepts,’John’ and ’Happy’, where ’John’ is an instance of ontologicalconcept ’Person’ and ’Happy’ is another ontological concept.So OC list contains ’Happy’ and ’Person:John’ concepts. Forsecond simple sentece we identify two concepts ’he’, whichrepresnts ’John’ which is an instance of ontological concept’Person’, and another ontological concept ’Lucky’. So afterjoining the concepts by their defined relationship, we getthe conceptual graph represented in Figure: 8. Here the twoindividual conceptual graph are joined by the conjunction’AND’.

For second sentence, it is also a collection of two simplesentences joined by ’and’. For first simple sentence, identifiedconcepts are ’John’, which is an instance of ontologicalconcept ’Person’ and another ontological concept ’Lucky’.For second simple sentece we identify a concept ’he’, whichrepresents john, which is an instance of ontological concept’Person’. But we cannot map the identified concept ’Glad’with any ontological concept. So the system checks the Word-Net for synonyms of glad and finds that glad is identical to’Happy’, which is a domain ontological concept. So from theontological concepts list, system builds the conceptual graphwhich is also identical to Figure: 8. The dotted line represnts


142

Person:John Happy

LuckyPerson:John

hasAttribute

hasAttribute

AND

Fig. 8. conceptual graph for compound sentence.

that the two individual conceptual graph are joined by sameontological concept ’Person:John’.

3) Case3 : For complex sentences: A casual user cangive a complex sentence with same semantic but in variousformat.

1. Because Ram forgot the time, he missed the test.2. Ram missed the test as he forgot the time.These two complex sentences have the same semantics but

in different forms. After parsing the first sentence, we identifythat there are two simple sentence, first one is dependentand second one is independent simple sentece. From the firstdependent sentence we identify three concepts, ’Ram’,’Forget’and ,’Time’ where ’Ram’ is an instance of ontological concept’Person’ and ’Forget’ and ’Time’ are another ontological con-cepts. So OC list contains ’Person:Ram’,’Forget’ and ’Time’concepts. For independent simple sentence we identify threeconcepts ’he’, which represents ’Ram’ which is an instance ofontological concept ’Person’, and another ontological conceptare ’Miss’ and ’Exam’. So after joining the concepts by theirdefined relationship, we get the conceptual graph representedin Figure: 9. The dotted line represnts that the two individualconceptual graphs is joined by same ontological concept’Person:Ram’.

For second sentence, it is also a collection of two sentences,first one is independent sentence and second is dependentsentence. For first independent simple sentence the identifiedconcepts are ’Ram’, which is an instance of ontologicalconcept ’Person’ and another concepts are ’Miss’ and ’Test’.Now ’Miss’ is a defined ontological concept, but ’Test’ cannotbe mapped with any concept. So the system checks thesynonyms of ’Test’ from WordNet. It finds that, a synonymof ’Test’ is ’Exam’ and it can be mapped with ontologicalconcept ’Exam’. So ’Exam’ must be added to OC List. Thefinal OC list contains three concepts ’Person:Ram’,’Miss’ and’Exam’. For dependent sentece we identify a concept ’he’,which represents ’Ram’, which is an instance of ontologicalconcept ’Person’ while ’Forget’ and ’Time’ are two definedontological concepts. So from the ontological concepts list,system builds the conceptual graph which is also identical toFigure: 9.

VI. CONCLUSION

In this work, we have defined a formal methodology toconvert the natural language sentence into its corresponding

Person:Ram agent Forget clock Time

Person:Ram agent Miss topic Exam

Fig. 9. conceptual graph for complex sentence.

conceptual graph form by consulting with its common domainontology. Thus a casual user and a computer can interactwith each other keeping the semantics of communication. Butthis approach cannot deal with complex sentences properly.Issues related to complex sentences are how to break itinto simple sentences and whether the simple sentences arecausally related or not. If they are causally related, then whichsentence must be executed first. So the future prospect ofthe work will be to define a formal method by which theproblem of representing complex sentence can be overcome.Another prospect of the work is to define a methodology whichcan deal with any kind of domain ontology, where all theverbs,nouns and adjectives of a particular domain may not bedefined as a concept or instance of a concept. In other wordsa methodology have to be developed which is independent ofhow the ontology is defined.

ACKNOWLEDGMENT

We are really greatful to our Information Technology de-partment of NIT,Durgapur for giving us a perfect environmentand all the facilities to do this work.

REFERENCES

[1] Sowa, J. F.:Conceptual Structures Information Processing in Mind andMachine, Addison-Wesley, Reading (1984)

[2] F. V. Harmelen, V. Lifschitz, and B. Porter:Handbook of KnowledgeRepresentation Elsevier, 2008, pp 213−237.

[3] Sowa, John F. (1976):Conceptual graphs for a database interface, IBMJournal of Research and Development 20:4, 336−357.

[4] Cao, T.H., Cao, T.D., Tran, T.L.:A Robust Ontology-Based Methodfor Translat- ing Natural Language Queries to Conceptual Graphs, In:Domingue, J., Anutariya, C. (eds.) ASWC 2008. LNCS, vol. 5367,pp.479492. Springer, Heidelberg (2008)

[5] C. Snae and M.1 Brueckner:Ontology-Driven E-Learning System Basedon Roles and Activities for Thai Learning Environment, In Interdisci-plinary Journal of Knowledge and Learning ObjectsVolume 3,2007.

[6] Nicholas Gibbins, Stephen Harris, Nigel Shadbolt.Agent based SemanticWeb Services, May 20−24, 2003, Budapest, Hungary. ACM 2003.

[7] A. G. Perez, M. F. Lopez and O. Corcho:Ontological Engineering,(Springer)

[8] Alberto H. F., Berthier A., Altigran S., Juliana S.A Brief Survey of WebData Extraction Tools, ACM SIGMOD Record, v.31 n.2, June 2002.

[9] Wael Salloum:A Question Answering System based on Conceptual GraphFormalism,IEEE Computer Society Press, New York, 2009.

[10] D. Moll and M. Van Zaanen:Learning of Graph Rules for QuestionAnswering, Proc. ALTW05, Sydney, December 2005.

[11] F. R. James, J. Dowdall, K. Kaljur, M. Hess, and D. Moll:ExploitingParaphrases in a Question Answering System, In Proc. Workshop inParaphrasing at ACL2003, 2003.


143

[12] F. D. France, F. Yvon and O. Collin:Learning Paraphrases to Improvea Question-Answering System, In Proceedings of the 10th Conference ofEACL Workshop Natural Language Processing for Question-Answering,2003.

[13] Tru H. Cao and Anh H. Mai:Ontology-Based Understanding of NaturalLanguage Queries Using Nested Conceptual Graphs, Lecture Notesin Computer Science, Springer-Verlag, 2010, Volume 6208/2010, pp.70−83 (2010)

[14] Svetlana Hensman:Construction of Conceptual Graph representationof texts, In Proceedings of Student Research Workshop at HLT-NAACL,Boston, 2004, 49–54.

[15] Damljanovic, D., Tablan, V., Bontcheva, K.:A text-based query interfaceto owl ontologies, In: 6th Language Resources and Evaluation Conference(LREC). ELRA, Marrakech (May 2008)

[16] Tablan, V., Damljanovic, D., Bontcheva, K.:A natural language queryinter- face to structured information, In: Bechhofer, S., Hauswirth, M.,Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp.361375. Springer, Hei- delberg (2008)

[17] Danica Damljanovic, Milan Agatonovic, and Hamish Cunningham:Natural Language Interfaces to Ontologies: Combining Syntactic Analysisand Ontology-Based Lookup through the User Interaction, In: Proceed-ings of the 7th Extended Semantic Web Conference (ESWC 2010).Lecture Notes in Computer Science, Springer-Verlag, Heraklion, Greece(June 2010)

[18] Damljanovic, D., Agatonovic, M., Cunningham, H.:Identification ofthe Question Focus: Combining Syntactic Analysis and Ontology-basedLookup through the User Interaction, In: 7th Language Resources andEvaluation Conference (LREC). ELRA, La Valletta (May 2010)

[19] George A.Miller,WordNet: An On-line Lexical Database, in the Inter-national Journal of Lexicography, Vol.3, No.4, 1990.


144

Web Services Privacy Measurement Based on Privacy

Policy and Sensitivity Level of Personal Information

Punyaphat Chaiwongsa and Twittie Senivongse

Computer Science Program, Department of Computer Engineering

Faculty of Engineering, Chulalongkorn University

Bangkok, Thailand


Abstract—Web services technology has been in the mainstream of

today’s software development. Software designers can select Web

services with certain functionality and use or compose them in

their applications with ease and flexibility. To distinguish

between different services with similar functionality, the

designers consider quality of service. Privacy is one aspect of

quality that is largely addressed since services may require

service users to reveal personal information. A service should

respect the privacy of the users by requiring only the information

that is necessary for its processing as well as handling personal

information in a correct manner. This paper presents a privacy

measurement model for service users to determine privacy

quality of a Web service. The model combines two aspects of

privacy. That is, it considers the degree of privacy principles

compliance of the service as well as the sensitivity level of user

information which the service requires. The service which

complies with the privacy principles and requires less sensitive

information would be of high quality with regard to privacy. In

addition, the service WSDL can be augmented with semantic

annotation using SAWSDL. The annotation specifies the

semantics of the user information required by the service, and

this can help automate privacy measurement. We also present a

measurement tool and an example of its application.

Keywords-privacy; privacy policy; personal information;

measurement; Web services; ontology

I. INTRODUCTION

Web services technology has been in the mainstream of software development since it allows software designers to use Web services with certain functionality in their applications with ease and flexibility. Software designers study service information that is published on service providers’ Web sites or through service directories and select the services that have the functionality as required by the application requirements. For those with similar functionality, different aspects of quality of service (QoS) are usually considered to distinguish them.

Privacy is one aspect of quality that is largely addressed since Web services may require service users to reveal personal information. An online shopping Web service may ask a user to give personal information such as name, address, phone number, and credit card number when buying products, and a student registration Web service of a university would also ask for students’ personal information to maintain student records. A Web service should respect the privacy of service users by

requiring only the information that is necessary for its processing as well as handling personal information in a correct manner. From a view of a service user, proper handling of the disclosed personal information is highly expected. From a view of a software designer who is developing a service-based application, it is desirable to select a Web service with privacy quality into the application since the privacy quality of the service contributes to that of the application. The application itself should also respect the privacy of the application users.

In this paper, we present a privacy measurement model for service users to determine privacy quality of a Web service. The model combines two aspects of privacy. That is, it considers the degree of privacy principles compliance of the service as well as the sensitivity level of user information which the service requires. The model follows the approach by Yu et al. [1] which assesses if the privacy policy of a Web service complies with a set of privacy principles. We enhance it by also considering sensitivity level of users’ personal information. The approach by Jang and Yoo [2] is adapted to determine sensitivity level of personal information that is exchanged with the service. According to our privacy measurement model, a service which complies with the privacy principles and requires less sensitive information would be of high quality with regard to privacy. In addition, we develop a supporting tool for the model. The tool relies on augmenting WSDL data elements of the service with semantic annotation using the SAWSDL mechanism [3]. The annotation specifies the meaning of WSDL data elements based on personal information ontology, i.e., a semantic term associated with a data element indicates which personal information the data element represents. Semantic annotation is useful for disambiguating user information that may be named differently by different Web services. As a result, it helps automate privacy measurement and facilitates the comparison of privacy quality of different Web services. Combining these two aspects of privacy, the model is considered practical for service users since the assessment is based on the privacy policy and service WSDL which can be easily accessed.

Section II of this paper discusses related work. Section III describes an assessment of privacy policy of a Web service based on privacy principles and Section IV presents measurement of sensitivity level of personal information. The privacy measurement model combining these two aspects of


145

privacy is proposed in Section V. The supporting tool is described in Section VI and the paper concludes in Section VII.

II. RELATED WORK

W3C has stated in the Web Services Architecture Requirements [4] that Web services architecture must enable privacy protection for service consumers. Web services must express privacy policy statements which comply with the Platform for Privacy Preferences (P3P), and the policy statements must be accessible to service consumers. Service providers generally publish privacy policy statements which follow privacy protection guidelines proposed by governmental or international organizations, and these statements are the basis for privacy protection measurement.

A. Related Work in Privacy Measurement Based on Privacy

Policy

Following Canadian Standards Association Privacy Principles, Yee [5] specifies how to define privacy policy, and a method to measure how well a service protects user privacy based on measurement of violations of the user’s privacy policy. The work is extended to consider compliances between E-service provider privacy policy and user privacy policy using a privacy policy agreement checker [6]. Similarly, Xu et al. [7] provide for a composite service and its user a policy compliance checker which considers sensitivity levels of personal data that flow in the service together with trust levels and data flow permission given to the services in the composition. Tavakolan et al. [8] propose a model for privacy policy and a method to match and rank privacy policies of different services with user’s privacy requirements. We are particularly interested in the work by Yu et al. [1] which follows 10 privacy principles defined in the Australia National Privacy Principles (Privacy Amendment Act 2000). The work proposes a checklist to rate privacy protection of a Web service with regard to each privacy principle. A privacy policy checker which can be plugged into the Web service application is also developed to check for privacy principles compliance.

B. Related Work in Privacy Measurement Based on

Sensitivity Level of Personal Information

Yu et al. [9] present a QoS model to derive privacy risk in service composition. The privacy risk is computed using the percentage of private data the users have to release to the services. The users can define weights that quantify a potential damage if the private data leak. Hewett and Kijsanayothin [10] propose privacy-aware service composition which finds an executable service chain that satisfies a given composite service I/O requirements with minimum number of services and minimum information leakage. To quantify information leakage, sensitivity levels are assigned to different types of personal information that flows in the composition. The composition also complies with users’ privacy preferences and providers’ trust. We are particularly interested in the comprehensive view of privacy sensitivity level of Jang and Yoo [2]. They address four factors of sensitivity, i.e. degree of conjunction, principle of identity, principle of privacy, and value of analogism. They also give a guideline to evaluate these sensitivity factors which we can adapt for the work.

III. ASSESSMENT OF WEB SERVICE PRIVACY POLICY

For the privacy policy aspect, we simply adopt a privacy

principles compliance assessment by Yu et al. [1]. According

to the Australia National Privacy Principles (Privacy

Amendment Act 2000), there are 10 privacy principles for

proper management of personal information. For each

principle, Yu et al. list a number of criteria to rate privacy

compliance of a service. For full detail of the compliance

checklist, see [1]. Here we show a small part of the checklist

through our supporting tool in Fig. 1. For instance, there are 3

criteria that a service has to follow to comply with the

collection principle, i.e., the privacy policy statements must

state (1) the kind of data being collected, (2) the method of

data collection, and (3) the purpose of data collection. The

service user can check with the published privacy policy how

many of these criteria the service satisfies, and then give the

compliance rating score. Thus for the collection principle, the

maximum rating is 3; the rating ranges between 0-3. The

service user can also define a weighted score for each privacy

principle denoting the relative importance of each principle.

The total privacy principle compliance (Pcom) score of a

service is computed by (1) [1]:

10

1

*i

com i

imaxi

rP p

r

(1)

where

ri = rating for principle i assessed by service user

rimax = maximum rating for principle i

pi = weighted score for principle i assigned by service

user, and 10

1

100i

i

p

.

Pcom ranges between 0-100. Instead we will later use a normalized NPcom, as in (2), which ranges between 0-1 in our privacy measurement model in Section V:

10

1

* /100100

i comcom i

imaxi

r PNP p

r

. (2)

As an example, a user of a Register service of a university, which registers student information, rates and gives a weight for each privacy principle as in Table I. Pcom of this service then is 87.08 and NPcom is 0.87.

Figure 1. Assessing privacy principles compliance using our tool.


146

TABLE I. EXAMPLE OF PRIVACY PRINCIPLES COMPLIANCE RATING

No. Privacy

Principles

Rating

ri

Max

Rating

rimax

Weight

pi

Score

ri/rimax*pi

1 Collection 2 3 20 13.33

2 Use and

Disclosure

2 2 10 10

3 Data Quality 2 2 5 5

4 Data Security 2 2 10 10

5 Openness 2 2 5 5

6 Access and

Correction

3 4 5 3.75

7 Identifiers 2 2 2 2

8 Anonymity 0 1 5 0

9 Transborder

Data Flows

2 2 8 8

10 Sensitive Information

1 1 30 30

Total 100 Pcom = 87.08

NPcom = 0.87

IV. ASSESSMENT OF SENSITIVITY LEVEL OF PERSONAL

INFORMATION

The motivation for assessing sensitivity level of personal

information is that, for different Web services with similar

functionality, a service user would prefer one to which

disclosure of personal information is limited. It is therefore

desirable that less number of personal data items is required by

the service and the data items that are required are also less

sensitive. We adapt from the approach by Jang and Yoo [2]

which analyzes sensitivity level of personal information based

on personal information classification.

A. Formal Concept Analysis and Ontology of Personal

Information

Jang and Yoo represent personal information classification using a formal concept analysis (FCA) [11]. The formal definition of a data group, i.e., personal information in this case, is given as

DG = (G, N, R)

where G is a finite set of concepts and can be described as G = g1, g2, ..., gn,

N is a finite set of attributes which describe the concepts and can be described as N = n1, n2, ..., nm, and

R is a binary relation between G and N, i.e., R ⊆ G × N. For example, g1 R n1, or (g1, n1) ∈ R, represents that the concept g1 has an attribute n1.

The formal concepts can also be described using a cross

table. We extend the cross table of [2] to create one as shown

in Table II. Here personal information is classified into 7

concepts, i.e., G = Basic, Career, …, Finance, and there are

37 personal information attributes, i.e., N = BirthPlace,

BirthDay, …, CreditcardNumber. The cross table shows the

relation, marked by an x, between each concept and attributes

of the concept. For example, BirthPlace belongs in the Basic

and Private concepts while the Basic concept has 15 attributes,

i.e., BirthPlace, BirthDay, …, DrivingLicenseNumber.

For a Web service, its WSDL interface document defines

what users’ personal information is required for the processing

of the service. However, different services with similar

functionality may name the exchanged data elements

differently. A service, for example, may require a data element

called Address whereas another requires Addr. In order to infer

that the two services require the same personal data, both

Address and Addr elements in the two WSDLs can be

annotated with the same semantic information. To

disambiguate user information that may be named differently

by different services, we augment WSDL data elements of a

service with semantic annotation using the SAWSDL

mechanism [3]. The annotation specifies the meaning of

WSDL data elements based on personal information ontology.

We represent the personal information concepts and attributes

in the cross table (Table II) as an OWL-based personal

information ontology as in Fig. 2. The attribute

sawsdl:modelReference is associated with a data element in

the WSDL document to reference to a semantic term in the

ontology. In the WSDL of the Register service in Fig. 3, the

meaning of the data element called Name is the term

PersonName in the ontology in Fig. 2, etc. Semantic

annotation is useful for automating privacy measurement and

facilitates comparison of privacy quality of different services.

TABLE II. CROSS TABLE OF PERSONAL INFORMATION, ADAPTED FROM [2]


147

Figure 2. Part of personal information ontology.

Figure 3. Part of semantics-annotated WSDL document.

B. Sensitivity Level of Personal Information

Jang and Yoo [2] address four factors of privacy sensitivity for personal information, i.e. degree of conjunction, principle of identity, principle of privacy, and value of analogism. They also give a guideline to evaluate these sensitivity factors which we can adapt for the work. We define the formula to compute the scores of these factors based on the cross table (Table II) as follows.

1) Degree of conjunction of an attribute (personal data

item) n is derived from the number of concepts which the

attribute n describes. This means n is associated with these

concepts and the disclosure of n may lead to other information

belonging in these concepts. The degree of conjunction of n or

DC(n) is determined by (3):

( ) .Cnumber of concepts in which nbelongs

D ntotal number of concepts

(3)

For example, from Table II, PersonName is associated with 5 out of 7 concepts, i.e., Basic, Career, Health, School, and Finance. Therefore DC(PersonName) = 5/7.

2) Principle of identity of an attribute n indicates that n is

an identity attribute of the concept with which it is associated,

i.e., n is used as a key information to access other attributes in

that concept. Disclosure of n may then lead to more problems

than disclosure of other attributes. The principle of identity of

n or IA(n) is determined by (4):

(4)

For example, from Table II, StudentID is an identity

attribute (i.e., it belongs in the concept Identity) for the concept School. There are 10 attributes associated with School and there are 37 attributes in total. Therefore IA(StudentID) = 10/37. For HomeAddress, it is not an identity attribute and IA(HomeAddress) = 0.

3) Principle of privacy of an attribute n indicates that n is

private information. Note that this is subjective to the service

users, e.g., some users may consider Age as private

information whereas others may not. We let the service users

customize the cross table by specifying which attributes are

considered private, i.e., belong in the concept Private. The

principle of privacy of n or PA(n) is determined by (5):

(5)

For example, from Table II, CellphoneNumber is private and PA(CellphoneNumber) = 1, whereas PersonalEmailAddress is not and PA(PersonalEmailAddress) = 0.

4) Value of analogism of an attribute n indicates that n can

be used to derive other attributes. This means the knowledge

of n can also reveal other personal information. The value of

analogism of n or AA(n) is determined by (6):

(6)

The analogy between attributes has to be defined and

associated with the cross table and the personal information ontology. For example, SocialSecurityNumber can derive other attribute such as BirthPlace, and AA(SocialSecurityNumber) = 1, whereas Age cannot and AA(Age) = 0.

0

( )

.

A

if nis not identity attribute

I n

if nis identity attributenumber of attributes intheconcepts

for theconceptstotal number of attributes

0

( )

1 .

A

if n does not belong in theconcept Private

P n

if nbelongs in theconcept Private

0

( )

1 .

A

if n cannot deriveother attributes

A n

if n can deriveother attributes

<xs:element name="RegisterRequest"> <xs:complexType> <xs:sequence> <xs:element name="Name" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#PersonName"/> <xs:element name="Address" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#HomeAddress"/> <xs:element name="MobilephoneNo" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#CellphoneNumber"/> <xs:element name="Email" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#PersonalEmailAddress"/> <xs:element name="StdID" type="xs:string" sawsdl:modelReference="http://localhost/ws/ontology/PI#StudentID"/> </xs:sequence> </xs:complexType> </xs:element>


148

All four sensitivity factor scores range between 0-1. Based on these scores, Jang and Yoo suggest that the sensitivity level of an attribute n or SL(n) be determined by (7) [2]:

SL(n) = DC(n) + IA(n) + PA(n) + AA(n). (7)

We propose to compute the sensitivity level of all personal

information exchanged with a Web service using (8):

1

ws i

k

L L

i

S S

(8)

where k = number of exchanged personal data elements

SLi = sensitivity level of personal data element i

computed by (7). We will later use a normalized NSLws, as in (9), which

ranges between 0-1 in our privacy measurement model in Section V:

1

.4 4

i ws

ws

kL L

L

i

S SNS

k k

(9)

As an example, suppose a Register service of a university requires the following personal information: Name, Address, MobilephoneNo, Email, and StdID. In the WSDL in Fig. 3, these data elements are annotated with semantic terms described in the personal information ontology in Fig. 2. We can determine the sensitivity level of each data element by calculating the sensitivity level of the associated semantic term using (7), and the total sensitivity level of all personal data required by the service using (8) and (9) as in Table III.

V. WEB SERVICES PRIVACY MEASUREMENT MODEL

We combine the two privacy aspects in Sections III and IV

into a privacy measurement model. The normalized privacy

principles compliance NPcom of a service is a positive aspect.

A service user would prefer a service with high compliance

rating. The service provider is encouraged to follow privacy

principles, provide proper management of users’ personal

information, and publish a clear privacy policy that can

facilitate compliance rating by the service users. On the

contrary, the normalized sensitivity level NSLws for the service

is a negative aspect. Using a service which exchanges highly

sensitive personal data could mean high risk of privacy

violation if these data are disclosed or not protected properly.

TABLE III. EXAMPLE OF SENSITIVITY LEVEL MEASUREMENT

Data

Element

Semantic

Annotation n

DC(n)

(3)

IA(n)

(4)

PA(n)

(5)

AA(n)

(6)

SL(n)

(7)

Name PersonName 5/7 0 0 0 0.71

Address HomeAddress 1/7 0 0 0 0.14

Mobilephone Number

Cellphone Number

6/7 0 1 0 1.86

Email PersonalEmail

Address

3/7 0 0 0 0.43

StdID StudentID 2/7 10/37 0 0 0.56

Total SLws =3.7

NSLws =3.7/

4*5

=0.19

The privacy quality P of a service is computed by (10).

The service user can also define weighted scores α and β to

denote relative importance of the two privacy aspects; α and β

are in [0, 1] and α + β = 1. The service which complies with

the privacy principles and requires less sensitive information

would be of high quality with regard to privacy.

(1 ).wscom LP NP NS (10)

As an example, given equal weights to the two privacy

aspects and the assessment in Tables II and III, the privacy

quality of the Register service is

P = (0.5)(0.87) + (0.5)(1 - 0.19)

= 0.435+0.405 = 0.84.

The Register service has high privacy principles

compliance level and requires personal data that are relatively

not so sensitive. It is therefore desirable in terms of privacy.

VI. DEVELOPMENT OF SUPPORTING TOOL

Besides the proposed model, we have developed a Web-

based tool called a privacy measurement system to support the

model. To be able to automate privacy measurement, the tool

relies on the service WSDL being annotated with semantic

terms described in the personal information ontology. The

usage scenario of the privacy measurement system is depicted

in Fig. 4 and can be described as follows.

1) The privacy measurement system obtains the cross table

and personal information ontology from a privacy domain

expert. In the prototype of the tool, the cross table in Table II

and a personal information ontology that corresponds to the

cross table are used.

2) A service user specifies the Web service to be measured

the privacy. Together with the service WSDL URL, the user

uses the tool to specify the following:

a) Privacy principles compliance rating ri and weight pi

for each privacy principle; the user will have to check with the

privacy policy of the service in order to rate.

Figure 4. Usage scenario of privacy measurement system.


149

b) Personal data attributes that are considered private;

these attributes will be associated with the concept Private of

the cross table.

c) Weights α and β for the privacy measurement model.

The users of the tool could be end users of the services or

software designers who are assessing privacy quality of the

services to be aggregated in service-based applications.

Additionally, service providers may use the tool for self-

assessment; the measurement can be used for comparison with

competing services and as a guideline for improving privacy

protection.

3) The tool imports the WSLD document of the service. It

is assumed that the service provider annotates the WSDL

based on the personal information ontology.

4) The tool calculates the privacy score of the service and

informs the user.

As an example, a screenshot reporting privacy

measurements of the Register service is shown in Fig. 5.

Figure 5. Example of measurements screen.

VII. CONCLUSION

This paper presents a privacy measurement model which

combines and enhances existing privacy measurement

approaches. The model considers both privacy principles

compliance and sensitivity level of personal information. The

basis of the measurement is the privacy policy published by

the service provider and user’s personal information that is

exchanged with the service. The model can be applied even in

the absence of any of such information. We present also a

supporting tool which can automate privacy measurement

based on semantic annotation added to WSDL data elements.

Generally a service user can consider the privacy score as

one of the QoS scores to distinguish services with similar

functionality. As discussed earlier, the privacy score is

subjective to the users who assess the service. The score may

vary depending on how the service provider provides a proof

of privacy principles compliance, the expectation of the user

when rating the compliance, and the user’s personal view on

private data. Also, the cross table presented in Table II is an

example but not intended to be exhaustive. A privacy

measurement system can adjust the concepts, attributes, and

their relations within the cross table as well as the

corresponding personal information ontology.

Since the measurement tool makes use of semantics-

enhanced WSDLs, a limitation would be that we require the

service providers to specify semantics. However, semantic

information only helps automate the calculation and the

measurement model itself does not rely on semantic

annotation. The approach can still be followed and the

measurement model can still be used even though WSDL

documents are not semantics-annotated.

At present, we target privacy of single Web services. The

approach can be extended to composite services. We are

planning for an empirical evaluation of the model by service

users and an experiment with real-world Web services as well

as cloud services.

REFERENCES

[1] W. D. Yu, S. Doddapaneni, and S. Murthy, “A privacy assessment approach for service oriented architecture applications,” in Procs. of 2nd IEEE Int. Symp. on Service-Oriented System Engineering (SOSE 2006), 2006, pp. 67-75.

[2] I. Jang and H. S. Yoo, “Personal information classification for privacy negotiation,” in Procs. of 4th Int. Conf. on Computer Sciences and Convergence Information Technology (ICCIT 2009), 2009, pp. 1117-1122.

[3] W3C, Semantic Annotations for WSDL and XML Schema, http://www.w3.org/TR/2007/REC-sawsdl-20070828/, 28 August 2007.

[4] W3C, Web Services Architecture Requirements, http://www.w3.org/TR/wsa-reqs/, 11 February 2004.

[5] G. Yee, “Measuring privacy protection in Web services,” in Procs. of IEEE Int. Conf. on Web Services, 2006, pp.647-654.

[6] G. O. M. Yee, “An automatic privacy policy agreement checker for E-services,” in Procs. of Int. Conf. on Availability, Reliability and Security, 2009, pp. 307-315.

[7] W. Xu, V. N. Venkatakrishnan, R. Sekar, and I. V. Ramakrishnan, “A framework for building privacy-conscious composite Web services,” in Procs. of IEEE Int. Conf. on Web Services,2006, pp. 655-662.

[8] M. Tavakolan, M. Zarreh, and M. A. Azgomi, “An extensible model for improving the privacy of Web services,” in Procs. of Int. Conf. on Security Technology, 2008, pp. 175-179.

[9] T. Yu, Y. Zhang, Y., K. J. Lin, “Modeling and measuring privacy risks in QoS Web services,” in Procs. of 8th IEEE Int. Conf. on E-Commerce Technology and 3rd IEEE Int. Conf. on Enterprise Computing, E-Commerce, and E-Services, 2006.

[10] R. Hewett and P. Kijsanayothin, “On securing privacy in composite web service transactions,” in Procs. of 5th Int. Conf. for Internet Technology and Secured Transactions (ICITST’09), 2009, pp. 1-6.

[11] Uta Priss, “Formal Concept Analysis,” http://www.upriss.org.uk /fca/ fca.html/, Last accessed: 24 February 2012.


150

Measuring Granularity of Web Services with

Semantic Annotation

Nuttida Muchalintamolee and Twittie Senivongse

Computer Science Program, Department of Computer Engineering

Faculty of Engineering, Chulalongkorn University

Bangkok, Thailand


Abstract— Web services technology has been one of the

mainstream technologies for software development since Web

services can be reused and composed into new applications or

used to integrate software systems. Granularity or size of a

service refers to the functional scope or the amount of detail

associated with service design and it has an impact on the ability

to reuse or compose the service in different contexts. Designing a

service with the right granularity is a challenging issue for service

designers and mostly relies on designers’ judgment. This paper

presents a granularity measurement model for a Web service

with semantics-annotated WSDL. The model supports different

types of service design granularity, and semantic annotation

helps with the analysis of the functional scope and amount of

detail associated with the service. Based on granularity

measurement, we then develop a measurement model for service

reusability and composability. The measurements can assist in

service design and the development of service-based applications.

Keywords-service granularity; measurement; reusability;

composability; semantic Web services; ontology

I. INTRODUCTION

Web Services technology has been one of the mainstream technologies for software development since it enables rapid flexible development and integration of software systems. The basic building blocks are Web services which are software units providing certain functionalities over the Web and involving a set of interface and protocol standards, e.g. Web Service Definition Language (WSDL) as a service contract, SOAP as a messaging protocol, and Business Process Execution Language (WS-BPEL) as a flow-based language for service composition [1]. The technology promotes service reuse and service composition as the functionalities provided by a service should be reusable or composable in different contexts of use. Granularity of a service impacts on its reusability and composability.

Erl [1] defines granularity in the context of service design as “the level of (or absence of) detail associated with service design.” The service contract or service interface is the primary concern in service design since it represents what the service is designed to do and gives detail about the scope or size of it. Erl classifies four types of service design granularity: (1) Service granularity refers to the functional scope or the quantity of potential logic the service could encapsulate based on its

context. (2) Capability granularity refers to the functional scope of a specific capability (or operation). (3) Data granularity is the amount of data to be exchanged in order to carry out a capability. (4) Constraint granularity is the amount of validation constraints associated with the information exchanged by a capability.

Different types of granularity impacts on service reusability and composability in different ways. Erl differentiates between these two terms. Reusability is the ability to express agnostic logic and be positioned as a reusable enterprise resource, whereas composability is the ability to participate in multiple service composition [1]. A coarse-grained service with a broad functional context should be reusable in different situations while a fine-grained service capability can be composable in many service assemblies. Coarse-grained data exchanged by a capability could be a sign that the capability has a large scope of work and should be good for reuse while a capability with very fine-grained (detailed) data validation constraints should be more difficult to reuse or compose in different contexts with different data formats. Inappropriate granularity design affects not only reusability and composability but also performance of the service. Fine-grained capabilities, for example, may incur invocation overheads since many calls have to be made to perform a task [2]. Designing a service with the right granularity is a challenging issue for service designers and mostly relies on designers’ judgment.

To help determine service design granularity, we present a granularity measurement model for a Web service with semantics-annotated WSDL. The model supports all four types of granularity and semantic annotation is based on the domain ontology of the service which is expressed in OWL [3]. The motivation is semantic annotation should give more information about functional scope of the service and other detail which would help to determine granularity more precisely. Semantic concepts from the domain ontology can be annotated to different parts of a WSDL document using Semantic Annotation for WSDL and XML Schema (SAWSDL) [4]. Based on granularity measurement, we then develop a measurement model for service reusability and composability.

Section II of the paper discusses related work. Section III introduces a Web service example which will be used throughout the paper. The granularity measurement model and


151

the reusability and composability measurement models are presented in Sections IV and V. Section VI gives an evaluation of the models and the paper concludes in Section VII.

II. RELATED WORK

Several research has addressed the importance of granularity to service-oriented systems. Haesen et al. [5] proposes a classification of service granularity types which consists of data granularity, functionality granularity, and business value granularity. Their impact on architectural issues, e.g., reusability, performance, and flexibility, is discussed. In their approach, the term “service” refers more to an operation rather than a service with a collection of capabilities as defined by Erl. Feuerlicht [6] discusses that service reuse is difficult to achieve and uses composability as a measure of service reuse. He argues that granularity of services and compatibility of service interfaces are important to composability, and presents a process of decomposing coarse-grained services into fine-grained services (operations) with normalized interfaces to facilitate service composition.

On granularity measurement, Shim et al. [7] propose a design quality model for SOA systems. The work is based on a layered model of design quality assessment. Mappings are defined between design metrics, which measure service artifacts, and design properties (e.g., coupling, cohesion, complexity), and between design properties and high-level quality attributes (e.g., effectiveness, understandability, reusability). Service granularity and parameter granularity are among the design properties. Service granularity considers the number of operations in the service system and the similarity between them (based on similarity of their messages). Parameter granularity considers the ratio of the number of coarse-grained parameter operations to the number of operations in the system. Our approach is inspired by this work but we focus only on granularity measurement for a single Web service, not on system-wide design quality, and will link granularity to reusability and composability attributes. We notice that their granularity measurement relies on the designer’s judgment, e.g., to determine if an operation has fine-grained or coarse-grained parameters. We thus use semantic annotation to better understand the service. Another approach to granularity measurement is by Alahmari et al. [8]. They propose metrics for data granularity, functionality granularity, and service granularity. The approach considers not only the number of data and operations but also their types which indicate whether the data and operations involve complicated logic. The impact on service operation complexity, cohesion, and coupling is discussed. Khoshkbarforoushha et al. [9] measure reusability of BPEL composite services. The metric is based on analyzing description mismatch and logic mismatch between a BPEL service and requirements from different contexts of use.

III. EXAMPLE

An online booking Web service will be used to demonstrate our idea. It provides service for any product booking and includes several functions such as viewing product information and creating and managing booking. Fig. 1 shows the WSDL 2.0 document of the service. Suppose the WSDL is enhanced

with semantic descriptions. The figure shows the use of SAWSDL tags [4] to reference to the semantic concepts in a service domain ontology to which different parts of the WSDL correspond. Here the meaning of the data type named ProductInfo is the term ProductInfo in the domain ontology OnlineBooking in Fig. 2, and the meaning of the operation named viewProduct is the term SearchProductDetail.

IV. GRANULARITY MEASUREMENT MODEL

Granularity measurement considers the schema and semantics of the WSDL description. Semantic granularity is determined first and then applied to different granularity types.

Figure 1. WSDL of online booking Web service with SAWSDL annotation.

<?xml version="1.0" encoding="UTF-8"?>

<wsdl:description targetNamespace="http://localhost:8101/GranularityMeasurement/

wsdl/OnlineBooking#"

xmlns="http://localhost:8101/GranularityMeasurement/wsdl/ OnlineBooking#"

xmlns:xs="http://www.w3.org/2001/XMLSchema"

xmlns:wsdl="http://www.w3.org/ns/wsdl" xmlns:sawsdl="http://www.w3.org/ns/sawsdl"> <wsdl:types>

<xs:schema targetNamespace="http://localhost:8101/

GranularityMeasurement/wsdl/OnlineBooking#" elementFormDefault="qualified">

<xs:element name="viewProductReq" type="productId"/> <xs:element name="viewProductRes" type="productInfo"/> … <xs:simpleType name="productId">

<xs:restriction base="xs:string">

<xs:pattern value="[0-9]4"/> </xs:restriction>

</xs:simpleType>

<xs:complexType name="productInfo" sawsdl:modelReference="http://localhost:8101/Granularity

Measurement/ontology/OnlineBooking#ProductInfo">

<xs:sequence> <xs:element name="productName" type="xs:string"/>

<xs:element name="productType" type="productType"/>

<xs:element name="description" type="xs:string"/> <xs:element name="unitPrice" type="xs:float"/>

</xs:sequence>

</xs:complexType> <xs:simpleType name="productType">

<xs:restriction base="xs:string">

<xs:pattern value="[A-Z]"/> </xs:restriction>

</xs:simpleType> … </xs:schema>

</wsdl:types> <wsdl:interface name="OnlineBookingWSService"

sawsdl:modelReference="http://localhost:8101/Granularity

Measurement/ontology/OnlineBooking#OrderManagement">

<wsdl:operation name="viewProduct" pattern="http://www.w3.org/ns/wsdl/in-out"

sawsdl:modelReference="http://localhost:8101/Granularity

Measurement/ontology/OnlineBooking#SearchProductDetail"> <wsdl:input element="viewProductReq"/>

<wsdl:output element="viewProductRes"/>

</wsdl:operation> … </wsdl:interface> </wsdl:description>


152

Figure 2. A part of domain ontology for online booking (in OWL).

A. Semantic Granularity

When a part of WSDL is annotated with a semantic term, we determine the functional scope and amount of detail associated with that WSDL part through the semantic information that can be derived from the annotation. Class-subclass and whole-part property are semantic relations that are considered. Class-subclass is a built-in relation in OWL but whole-part is not. We define an ObjectProperty part (see Fig. 2) to represent the whole-part relation, and any whole-part relation between classes will be defined as a subPropertyOf part. Then, semantic granularity of a term t which is in a class-subclass/whole-part relation is computed by (1):

Figure 3. Semantic granularity of ProductInfo and related terms.

SemanticGranularity( ) no.of terms under in either class-subclass relation

or whole-part relation, including itself

t t= (1)

Using (1), Fig. 3 shows semantic granularity of the semantic term ProductInfo and its related terms with respect to class-subclass and whole-part property relations. When an ontology term is annotated to a WSDL part, it transfers its semantic granularity to the WSDL part.

B. Constraint Granularity

A service capability (or operation) needs to operate on correct input and output data, so constraints are put on the exchanged data for a validation purpose. Constraint granularity considers the number of control attributes and restrictions (not default) that are assigned to the schema of WSDL data, e.g.,

• Attribute of <xs:element/> such as “fixed”, “nullable”, “maxOccur” and “minOccur”

• <xs:restriction/> which contains a restriction on the element content.

Constraint granularity R of a capability o is computed by (2):

in m

o ij

i=1 j=1

R = Constraint∑∑ (2)

where n = the number of parameters of the operation o

mi = the number of elements/attributes of ith parameter

Constraintij = the number of constraints of an element/

attribute of a parameter .

In Fig. 1, the operation viewProduct has two constraints on two out of five input/output data elements, i.e., constraints on productId and productType. So its constraint granularity is 2.

C. Data Granularity

A WSDL document normally describes the detail of the data elements, exchanged by a service capability, using the XML schema in its <types> tag. With semantic annotation to a

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" … <owl:Ontology />

<owl:ObjectProperty rdf:ID="part"/> … <owl:Class rdf:ID="OrderManagement" /> … <owl:Class rdf:ID="ProductInfo" />

<owl:Class rdf:ID="HotelInfo" >

<rdfs:subClassOf rdf:resource="#ProductInfo" />

</owl:Class> … <owl:Class rdf:ID="ProductName" >

<rdfs:subClassOf rdf:resource="#Name" />

</owl:Class>

<owl:FunctionalProperty rdf:ID="hasProductID">

<rdfs:subPropertyOf rdf:resource="#part"/>

<rdfs:domain rdf:resource="#ProductInfo" />

<rdfs:range rdf:resource="#ID" />

<rdf:type rdf:resource="&owl;ObjectProperty" />

</owl:FunctionalProperty>

<owl:FunctionalProperty rdf:ID="hasProductName">



<rdfs:range rdf:resource="#ProductName" />



<owl:FunctionalProperty rdf:ID="hasProductPrice">



<rdfs:range rdf:resource="#Price" />



<owl:FunctionalProperty rdf:ID="hasProductType">



<rdfs:range rdf:resource="#Type" />


</owl:FunctionalProperty> … <owl:Class rdf:ID="SearchProductDetail" />

<owl:Class rdf:ID="SearchProductInfo" >

<rdfs:subClassOf rdf:resource="#SearchProductDetail" />

</owl:Class>

<owl:Class rdf:ID="SearchRelatedProductInfo" >

<rdfs:subClassOf rdf:resource="#SearchProductDetail" />

</owl:Class>

<owl:Class rdf:ID="GetProductUpdate" />

<owl:Class rdf:ID="GetProductPriceUpdate" />

<owl:FunctionalProperty rdf:ID="hasGetProductUpdate">


<rdfs:domain rdf:resource="#SearchProductDetail" />

<rdfs:range rdf:resource="#GetProductUpdate" />



<owl:FunctionalProperty rdf:ID="hasGetProductPriceUpdate">


<rdfs:domain rdf:resource="#SearchProductDetail" />

<rdfs:range rdf:resource="#GetProductPriceUpdate" />


</owl:FunctionalProperty> … </rdf:RDF>


153

data element, semantic detail is additionally described. If the semantic term is defined in a class-subclass relation (i.e., it has subclasses), then the term will transfer its generalization, encapsulating several specialized concepts, to the data element that it annotates. If the semantic term is defined in a whole-part relation (i.e., it has parts), it will transfer its whole concept, encapsulating different parts, to the data element that it annotates.

For a data element with no sub-elements (i.e., lowest-level element), we determine its granularity DGLE by its class-subclass and whole-part relations. For whole-part, if the element has an associated whole-part semantics, we determine the parts from the semantic term; otherwise the part is 1, denoting the lowest-level element itself (see (3)). For a data element with sub-elements, we compute its granularity DGE by a summation of the data granularity of all its immediate sub-elements DGSE together with the semantic granularity of the element itself (see (4)). Note that (4) is recursive. Finally, for data granularity Do of a capability o, we compute a summation of data granularity of all parameter elements (see (5)).

max(1, )LE p pDG ac ap= + (3)

1

j

m

E SE p p

j

DG DG ac ap=

= + +∑ (4)

1

i

n

o E

i

D DG=

= ∑ (5)

where n = the number of parameters of the operation o

DGE = data granularity of an element with sub-

elements/attributes

m = the number of sub-elements/attributes of an element

DGSE = data granularity of an immediate sub-element/

attribute of an element

DGLE = data granularity of a lowest-level element/

attribute

acp = semantic granularity in the class-subclass relation

of an element/attribute, computed by (1)

app = semantic granularity in the whole-part property

relation of an element/ attribute, computed by (1).

In Fig. 1, the input viewProductReq of the operation viewProduct has no sub-elements or semantic annotation, so its granularity as a DGLE is 1 (0+max(1, 0)). In contrast, the output viewProductRes is of type productInfo which is also annotated with the ontology term ProductInfo. From the schema in Fig. 1, this output has four sub-elements (productName, productType, description, unitPrice). Each sub-element has no further sub-

elements or semantic annotation, so its granularity as a DGLE is 1 as well. In Fig. 3, the semantic term ProductInfo has three direct subclasses and three indirect subclasses as well as four parts. The granularity of the output data viewProductRes as a DGE would be 16 (i.e., ((1+1+1+1)+7+5). Therefore data granularity Do of the operation viewProduct is 17 (1+16).

D. Capability Granularity

The functional scope of a service capability can be derived from data granularity and semantic annotation. If large data are exchanged by the capability, it can be inferred that the capability involves a big task in the processing of such data. We can additionally infer that the capability is broad in scope if its semantics involves other specialized functions (i.e., having a class-subclass relation) or other sub-tasks (i.e., having a whole-part relation). Capability granularity Co of a capability o is then computed by (6):

o o o oC = D +ac +ap (6)

where Do = data granularity of the operation o

aco = semantic granularity in the class-subclass relation

of the operation o, computed by (1)

apo = semantic granularity in the whole-part property

relation of the operation o, computed by (1).

From the previous calculation, data granularity of the operation viewProduct in Fig. 1 is 17. This operation is annotated with the semantic term SearchProductDetail. In Fig. 2, this semantic term is a generalization of two concepts SearchProductInfo and SearchRelatedProductInfo, so the capability viewProduct encapsulates these two specialized tasks. The semantic term SearchProductDetail also comprises two sub-tasks GetProductUpdate and GetProductPriceUpdate in a whole-part relation. Therefore capability granularity of viewProduct is 23 (17+3+3).

E. Service Granularity

The functional scope of a service is determined by all of its capabilities together with semantic annotation which would describe the scope of use of the service semantically. Service granularity Sw of a service w is computed by (7):

1

i

k

w o w w

i

S C ac ap=

= + +∑ (7)

where k = the number of operations of the service w

Co = capability granularity of an operation o

acw = semantic granularity in the class-subclass relation

of the service w, computed by (1)

apw = semantic granularity in the whole-part property

relation of the service w, computed by (1).


154

In Fig. 1, the online booking service is associated with the semantic term OrderManagement. Suppose the term OrderManagement has no subclasses but comprises eight concepts (i.e., parts) in a whole-part property relation. So its service granularity is the summation of capability granularity of the operation viewProduct (i.e., 23), capability granularity of all other operations, and semantic granularity in class-subclass and whole-part property relations (i.e., 1+9).

It is seen from the granularity measurement model that semantic annotation helps complement granularity measurement. For the case of the operation viewProduct, for example, the granularity of its capability can only be inferred from the granularity of its data if the operation has no semantic annotation. However, by annotating this operation with the generalized term SearchProductDetail, we gain knowledge about its broad scope such that its capability encapsulates both specialized SearchProductInfo and SearchRelatedProductInfo tasks. The additional information refines the measurement.

V. REUSABILITY AND COMPOSABILITY MEASUREMENT

MODELS

As mentioned in Section I, reusability is the ability to express agnostic logic and be positioned as a reusable enterprise resource, whereas composability is the ability to participate in multiple service composition. We see that reusability is concerned with putting a service as a whole to use in different contexts. Composability is seen as a mechanism for reuse but it focuses on assembly of functions, i.e., it touches reuse at the operation level, rather than the service level. We follow the method in [7] to first identify the impact the granularity has on reusability and composability attributes and then derive measurement models for them. Table I presents impact of granularity.

For reusability, a coarse-grained service with a broad functional context providing several functionalities should be reused well as it can do many tasks serving many purposes. Coarse-grained data, exchanged by an operation, could be a sign that the operation has a large scope of work and should be good for reuse as well. So we define a positive impact on reusability for coarse-grained data, capabilities, and services. For composabilty, we focus at the service operation level and service granularity is not considered. A small operation doing a small task exchanging small data should be easier to include in a composition since it does not do too much work or exchange excessive data that different contexts of use may require or can provide. So we define a negative impact on composability for coarse-grained capabilities and data. For constraints on data elements, the bigger number of constraints means finer-grained restrictions are put on the data; they make the data more specific and may not be easy for reuse, hence a negative impact on both attributes.

TABLE I. IMPACT OF GRANULARITY ON REUSE

Granularity Type Reusability Composability

Service Granularity ↑ -

Capability Granularity ↑ ↓

Data Granularity ↑ ↓

Constraint Granularity ↓ ↓

A. Reusability Model

Reusability measurement is derived from the impact of granularity. It can be seen that different types of granularity measurement relate to each other. That is, service granularity is built on capability granularity which in turn is built on data granularity, and they all have a positive impact. So we consider only service granularity in the model since the effects of data granularity and capability granularity are already part of service granularity. The negative impact of constraint granularity is incorporated in the model (8):

1

i

k

w o

i

Reusability S R=

= − ∑ (8)

where Sw = service granularity of the service w

Ro = constraint granularity of the operation o

k = the number of operations of the service w.

A coarse-grained service with small data constraints has high reusability.

B. Composability Model

In a similar manner, we consider only capability granularity and constraint granularity in the composability model because the effects of data granularity are already part of capability granularity. Since they all have a negative impact, we represent composability measure in the opposite meaning. We define a term “uncomposabilty” to represent an inability of a service operation to be composed in service assembly (9):

o oUncomposability C R= + (9)

where Co = capability granularity of the operation o

Ro = constraint granularity of the operation o.

A fine-grained capability with small data constraints has low uncomposability, i.e. high composability.

VI. EVALUATION

We apply the measurement models to two Web services. The first one is the online booking Web service which we have used to demonstrate the idea. It is a general service including a large number of small data and operations. Its scope covers viewing, managing, and booking products. Another Web service is an online order service which has only a booking-related function. The two Web services are annotated with semantic terms from the online booking ontology which describes detail about processes and data in the online booking domain. Table II shows details of some operations of the two services including their capabilities, data, and semantic annotation.

For the evaluation, a granularity measurement tool is developed to automatically measure granularity of Web services. It is implemented using Java and Jena [10] which helps with ontology processing and inference of relations.


155

Table III presents granularity measurements and reusability scores. The online booking service is coarser and has higher reusability. It is a bigger service with wider range of functions, exchanging more data, and having a number of data constraints. It is likely that the online booking service can be put to use in various contexts. On the other hand, the online order service is finer-grained focusing on order management. The two services are annotated with semantic terms of the same ontology, and additional semantic detail helps refine their measurements.

Table IV presents granularity measurements and uncomposability of the operations annotated with the semantic term UpdateOrder. The operation editOrderItem of the online order service has coarser data and capability compared to the three finer-grained operations of the online booking service, and therefore it is less composable.

VII. CONCLUSION

This paper explores the application of semantics-annotated WSDL to measuring design granularity of Web services. Four types of granularity are considered together with semantic granularity. The models for reusability and composability (represented by uncomposability) are also introduced.

As explained in the example, semantic annotation can help us derive the functional contexts and concepts that the service, capability, and data element encapsulate. Granularity measurement which is traditionally done by analyzing the size of capability and data described in standard WSDL and XML schema documents can be refined and better automated.

TABLE II. PART OF SERVICE DETAIL AND SEMANTIC ANNOTATION

Operation Input Data Type Output Data Type

Name Annotation Name Annotation Name Annotation

Online booking web service

newCart Insert

Order

userId ID orderId ID

addProduct ToCart

Update Order

addProduct OrderItem process Result

Status

delete

Product FromCart

Update

Order

delete

Product

OrderItem process

Result

Status

editProduct

Quantity InCart

Update

Order

editProduct

Quantity

OrderItem process

Result

Status

view

Product

InCart

Search

OrderItem

ByOrderID

orderId ID orderItem

List

-

reservation EditOrder reserved

Order

ID process

Result

Status

Online order web service

createOrder Create

Order

order

Request

Order order

Response

Status

edit

OrderItem

Update

Order

editOrder

ItemInfo

Order orderItem

Response

Status

submit

Order

EditOrder orderId ID order

Response

Status

TABLE III. GRANULARITY AND REUSABILITY

Service Name Granularity Reusability

∑∑∑∑Ro ∑∑∑∑Do ∑∑∑∑Co Sw Sw - ∑∑∑∑Ro

OnlineBookingWSService 48 143 184 194 146

OnlineOrderWSService 10 47 62 72 62

TABLE IV. SERVICE GRANULARITY AND UNCOMPOSABILITY OF

OPERATIONS ANNOTATED WITH UPDATEORDER

Service

Name

Operation

Name

Granularity Uncomposability

Ro Do Co Sw Co + Ro

Online

Booking

WSService

addProduct

ToCart

4 15 18 - 22

DeleteProduct

FromCart

3 14 17 - 20

editProduct

Quantity InCart

4 15 18 - 22

Online

Order WSService

editOrderItem 3 19 22 - 25

For future work, we aim to refine the domain ontology and WSDL annotation. It would be interesting to see the effect of annotation on granularity, reusability, and composability when the WSDL contains a lot of annotations compared to when it is less annotated. Since annotation can be made to different parts of WSDL, the location of annotations can also affect granularity scores. Additionally we will try the models with Web services in business organizations and extend the models to apply to composite services.

REFERENCES

[1] T. Erl, SOA: Principle of Service Design, Prentice Hill, 2007.

[2] T. Senivongse, N. Phacharintanakul, C. Ngamnitiporn, and M. Tangtrongchit, “A capability granularity analysis on Web service invocations,” in Procs. of World Congress on Engineering and Computer Science 2010 (WCECS 2010), 2010, pp. 400-405.

[3] W3C (2004, February 10) OWL Web Ontology Language Overview [Online]. Available: http://www.w3.org/TR/2004/REC-owl-features-20040210/

[4] W3C (2007, August 28) Semantic Annotations for WSDL and XML Schema [Online]. Available: http://www.w3.org/TR/2007/REC-sawsdl-20070828/

[5] R. Haesen, M. Snoeck, W. Lemahieu and S. Poelmans, “On the definition of service granularity and its architectural impact,” in Procs. of 20th Int. Conf. on Advanced Information Systems Engineering (CAiSE 2008), LNCS 5074, 2008, pp. 375-389.

[6] G. Feuerlicht, “Design of composable services,” in Procs. of 6th Int. Conf. on Service Oriented Computing (ICSOC 2008), LNCS 5472, 2008, pp. 15-27.

[7] B. Shim, S. Choue, S. Kim and S. Park, “A design quality model for service-oriented architecture,” in Procs. of 15th Asia-Pacific Software Engineering Conference (APSEC 2008), 2008, pp. 403-410.

[8] S. Alahmari, E. Zaluska, D. C. De Roure, “A metrics framework for evaluating SOA service granularity,” in Procs. of IEEE Int. Conf. on Service Computing (SCC 2011), 2011, pp. 512-519.

[9] A. Khoshkbarforoushha, P. Jamshidi, F. Shams, “A metric for composite service reusability analysis,” in Procs. of the 2010 ICSE Workshop on Emerging Trends in Software Metrics (WETSoM 2010), 2010, pp. 67-74.

[10] Apache Jena [online]. Available: http://incubator.apache.org/jena/, Last accessed: January 30, 2012.


156

Decomposing ontology in Description Logics

by graph partitioning

Thi Anh Le PHAM


Hanoi National University of

Education

Hanoi, Vietnam

[email protected]

Nhan LE-THANH

Laboratory I3S

Nice Sophia-Antipolis University

Nice, France

[email protected]

Minh Quang NGUYEN


Hanoi National University of

Education

Hanoi, Vietnam

[email protected]

Abstract— In this paper, we investigate the problem of

decomposing an ontology in Description Logics (DLs)

based on graph partitioning algorithms. Also, we focus on

syntax features of axioms in a given ontology. Our

approach aims at decomposing the ontology into many sub

ontologies that are as distinct as possible. We analyze the

algorithms and exploit parameters of partitioning that

influence the efficiency of computation and reasoning.

These parameters are the number of concepts and roles

shared by a pair of sub-ontologies, the size (the number of

axioms) of each sub-ontology, and the topology of

decomposition. We provide two concrete approaches for

automatically decomposing the ontology, one is called

minimal separator based partitioning, and the other is

eigenvectors and eigenvalues based segmenting. We also

tested on some parts of used TBoxes in the systems FaCT,

Vedaall, tambis, ... and propose estimated results.

Keywords- Graph partitioning; ontology decomposition;

image segmentation

I. INTRODUCTION

The previous studies about DL-based ontologies focus on the tasks such as ontology design, ontology integration and ontology deployment, … Starting from the fact that one wants to effectively solve with a large ontology, instead of the ontology integrating we examine the ontology decomposing. There were some investigations in decomposition of DL ontologies as decomposition-based module extraction [3] or based on syntax structure of ontology [1].

The previous paper [8] shown the executions on the supposition that there exists an ontology (TBox) decomposition called overlap decomposition. This decomposition resulted in preservation of semantic and inference results with respect to original TBox. Our aim is to establish the theoretical foundations for decomposition methods that improves the efficiency of reasoning and guarantees the properties proposed in [7]. The automatic decomposition of a given ontology is an optimal step in ontology design that is supported by graph theory. The graph theory provided the “good properties” that adapt necessary requirements of our decomposition.

Our computational analysis of reasoning algorithms guides us to suggest the parameters of such decomposition: the number of concepts and roles included in the semantic mappings between partitions, the size of each component ontology (the number of axioms in each component) and the topology of the decomposition graph. There are two decomposition approaches based on two ways of presenting the ontology. One presents the ontology by a symbols graph, which implements decomposition by minimal separator and the other uses axiom graph, corresponding to the image segmentation method.

The rest of the paper is organized as follows. Section 2 proposes a definition of G-decomposition methodology that is based on graph and summarizes some principal steps. In this section, we also recall the criteria for a good decomposition. Sections 3 and 4 describe two ways for transforming an ontology into an undirected graph (symbol graph or weighted graph), as well as two partitioning algorithms of the obtained graph. Section 5 presents some evaluations of the effects of the decomposition algorithms and experimental results. Finally, we provide some conclusions and future work in section 6.

II. G-DECOMPOSITION METHODOLOGY

In this paper, ontology decomposition will be considered only from terminological level (TBox). We research some methods that decompose a given TBox into several sub-TBoxes. For simple, now a TBox can be briefly presented by the single set of axioms A, so we will present the set of axioms by graph.

Our goal is to eliminate general concept inclusions (GCIs), a general type of axiom, as much as possible from a general ontology (presented by a TBox) by decomposing the set of GCIs of this ontology into several subsets of GCIs (presented by a distributed TBox). In this paper, we only consider the syntax approach based on the structures of GCIs. We recall some criteria of a good decomposition [8]:

- All the concepts, roles and axioms of the original ontology are kept through the decomposition.


157




- The number of axioms in the sub-TBoxes is equivalent.

As a result, we propose two techniques for decomposing based on graphs. G-decomposition is an ontology decomposition method that applies graph partitioning techniques. This graph decomposition is presented by an intersection graph (decomposition graph) in which each vertex is a sub-graph and the edges present the connections of each pair of vertices. In [8] we defined an overlap decomposition of a TBox, it is presented by a distributed TBox (decomposed TBox) that consists of a set of sub-TBoxes and a set of links between these sub-TBoxes (semantic mappings). We assume that readers are familiar with basic concepts in graph theory.

Consequently, we propose an ontology decomposition method as a process contain three principal phases (illustrated in the table 1): transform a TBox into a graph, decompose the graph into sub-graphs, transform these sub-graphs into a distributed TBox. We present a general algorithm of G-decomposition.

TABLE 1. A GENERAL ALGORITHM OF G-DECOMPOSITION

PROCEDURE DECOMP-TBOX (T = (C, R, A))

T = C, R, A is a TBox, with the set of concepts C, the set

of roles R, and the set of axioms A

(1) TRANS-TBOX-GRAPH (T = (C, R, A))

Build a graph G = (V, E) of this TBox, where each vertex v ∈V is a concept in C or a role in R (or an axiom in A), each edge e = (u, v) ∈ E if u and v appear in the same axiom (or u and v have at least a common concept (role))

(2) DECOMP-GRAPH (G = (V, E ))

Decompose the obtained graph G = (V, E) in the procedure TRANS-TBOX-GRAPH into an intersection graph G0 = (V0, E0), with each vertex v∈V0 is a sub-graph, each edge e = (u, v)∈E0 if u and v are linked.

(3) TRANS-GRAPH-TBOX (G0 = (V0, E0)

Transform the graph G = (V0, E0) into a distributed TBox, each vertex (sub-graph) corresponds to a sub-TBox, and edges of E0 correspond to semantic mappings.

In the next sections, we will introduce the detail techniques for the steps (1) and (2).

III. DECOMPOSITION BASED ON MINIMAL SEPARATOR

A. Symbol graph

A set of axioms A of TBox T, then Ex(A) is the set of

concepts and roles that appear in the expressions of A. For simple, we use the notation of symbol instead of concept (role), i.e., a symbol is a concept (role) in TBox. A graph presenting TBox will be defined as follow:

Definition 1 (symbol graph): A graph G = (V, E), where V is a set of vertices et E is a set of edges, is called a symbol

graph of T (A) if each vertex v V is a symbol of Ex(A), each

edge e = (u, v) E if u, v are in the same axiom of A.

So, given a set of axioms A, we can build a symbol graph G = (V, E) by taking each symbol in Ex(A) as a vertex and connecting two vertices by an edge if its symbols are in the same axiom of A. Follow this method, each axiom is presented as a clique in the symbol graph.

Example 1: Given a TBox as follows (figure 1):

Figuer 1. TBox Tfam

The set of primitive concepts and roles of Tfam: Ex(Tfam) =

C1; C2; C3; C4; C5; C6; X; Y; T; H. The figure 2 presents the symbol graph for Tfam

Figure 2. Symbol graph presenting TBox Tfam

Result of this decomposition is presented by a labeled graph (intersection graph or decomposition graph) Gp = (Vp, Ep). Assume that the graph G representing a TBox T is divided by n

sub-graphs Gi, in, then a decomposition graph is defined as follows:

Definition 2 (decomposition graph) [4]: Decomposition graph is a labeled graph Gp = (Vp, Ep) in which each vertex v

Vp is a partition (sub-graph) Gi, each edge eij = (vi, vj) Ep

is marked by the set of shared symbols of Gi and Gj, where i j,

i, j n.

Definition 3 ((a,b) – minimal vertex separators)[3]: A set

of vertices S is called (a,b) - vertex separator if a,b V\S and all paths connecting a and b in G pass through at least one vertex in S. If S is an (a,b) - vertex separator and not contains another (a,b) - vertex separator then S is an (a,b) - minimal vertex separator.


158

B. Algorithm

We present a recursive algorithm using Even algorithm [4] to find the sets of vertices that divide a graph into partitions. It takes a symbol graph G = (V, E) (presenting a TBox) as input and returns a decomposition graph and the set of separate sub-graphs. The idea of algorithm is to look for a connecting part of graph (cut), compute the minimal separator of graph, and then the graph is carved by this separator. Initially, the TBox T is

considered as a large partition, and it will be cut into two parts in each recursive iteration. The steps of algorithm are summarized as follows:

TABLE 2. AN ALGORITHM FOR TRANSFORMING THE TBOX INTO GRAPH

Input: TBox T (A), M limit number of symbols in a part (a sub-

TBox Ti ).

Output: Gp = (Vp, Ep), and Ti .

PROCEDURE DIVISION-TBOX (A , M )

(1) Transform A into a symbol graph G (V, E) with V =

Ex (A), E = (l1, l2)|∃A∈A, l1, l2 ∈Ex (A).

/A is an axiom in A

(2) Let Gp = (Vp, Ep) a undirect graph with Vp = V

and Ep = ∅.

(3) Call DIVISION-GRAPH(G, M, nil, nil).

(4) For each v ∈ Vp, let Tv = A ∈A |Ex(A) ⊂ v. Return

Tv, v ∈ Vp and Gp.

The procedure DIVISION-GRAPH takes the input consisting of a symbol graph G = (V, E) of T, a limited parameter M and

two vertices a, b that are initially assigned to nil. This procedure updates the global variable Gp for presenting the decomposition process. In each recursive call, it finds a minimal separator of vertices a, b in G. If one of a, b is nil or both are nils, it finds the global minimal separator between all vertices and the non-nil vertex (or all other vertices). This separator cuts the graph G into two parts G1, G2 and the process continues recursively on these parts.

TABLE 3. AN ALGORITHM FOR PARTITIONING SYMBOL GRAPH

Input: G = (V, E)

Output: connection graph Gp = (Vp, Ep) PROCEDURE DIVISION-GRAPH (G, M, a, b)

(1) Find the set of minimal vertex separator of G: - Select an pair of non-adjacent arbitrary vertices (a, b)

and compute set of (a, b) – minimal separator

- Repeat this process on every pair of non-adjacent

vertices x, y

(2) Find global minimal vertex separator S* between all

vertices of G

(3) Decomposing G by S* in two sub-graphs G1, G2,

where S* is in all G1 and G2

(4) Generate an undirect graph Gp = (Vp, Ep), where Vp =

G1, G2 and Ep = S*.

The algorithm describes a method to list all of the (a, b) - vertex minimal separators from a pair of non-adjacent vertices by best-first search technique can be seen in [6].

Tfam (figure 1) can be presented by an undirect adjacent graph

(figure 2), where the vertices of this graph correspond to the symbols, the edges of the graph connecting the two vertices corresponding to two symbols of the same axiom. Therefore each axiom would be represented as a clique.

Figure 3. Decomposing result of symbol graph of Tfam with minimal

separator S*= X and S*’ =Y

If the criterion based on balance of TBox axioms number between components then S

*= X and S

*’=Y. Using S* and

S*’ to decompose the symbol graph, we obtain three symbol groups C1, C2, C3, X, C4, C5, X, Y and C6, H, Y, T. So we get three corresponding TBoxes:

1T = A1, A2, A7, A8, 2T =

A3, A4, A9, A10 and T3 = A5, A6. The number of symbols of

S*

and S*’

is 1 (|S*| = |X| = 1, |S

* ’| = |Y|=1). The

cardinality of three TBoxes are respectively N1 = 4; N2 = 4, N3

= 2. In this case, the cardinality of symbols in each TBox is also equivalent.

The image of symbol graph of Tfam after decomposing as in

the figure 3. Obtained Result TBoxes 1T ,

2T and T3 after

decomposition preserve all the concepts, roles and axioms of original Tfam. In addition,

1T and 2T satisfy the proposed

criteria for decomposing.

We have executed graph partition algorithm based on minimal separator. This method returns result that satisfies the given properties. All concepts, roles, and axioms are preserved through decomposing. Relations between them are represented by the edges of symbol inter-graph. This method minimizes symbols shared between component TBoxes ensuring independency property of sub-TBoxes. However, to get the


159

result TBoxes, requiring transfer the obtained graphs into the

sets of axioms for the corresponding TBoxes.

IV. DECOMPOSITION BASED ON NORMALIZED CUT

A. Axiom graph

In this section, we propose another decomposition technique

based on axiom graph that is defined as follow:

Definition 4 (axiom graph): A weight undirect graph G = (V, E), where V is a set of vertices and E is a set of edges with

the weight values, is called an axiom graph if each vertex v V

is an axiom in TBox T, each edge e = (u, v) E if u, v V and

there is at least a shared symbol between u and v, and the weight on e (u, v) is a value presenting the similarity between the vertices u and v.

By using only the common symbols between each pair of axioms, we can simple define a weight function p: V x V R that send a pair of vertices to a real number. In particular, each edge (i, j) is assigned a value wij describing the connection (association) between axioms Ai and Aj as: wij = nij/(ni + nj),

where i, j = 1,..,m, i j, m is the number of axioms in T (m = |A|), ni, nj is the symbol number of Ai and Aj respectively, nij is

the number of symbols in Ai Aj (nij = |Ai Aj|).

B. Normalized cut

Ontology decomposition algorithm based on image segmentation is a grouping method using eigenvectors:

Assume that G = (V, E) is divided into two separate sets A

and B, (A B = V and A B = ) by removing the edges connecting two parts from the original graph. The association between these parts is the total weight of the removed, in the language of graph theory, it is called the cut:

cut(A, B) = u A, v B

w(u, v) (1)

i.e. the total number of connections between the nodes of A and the nodes of B. Optimal decomposition of graph is not only to minimize this disassociation, but also to maximize the association in every partition. In addition, NCut (normalized Cut) is also used to measure disassociation, denoted by Ncut as follows:

Ncut (A, B) = cut(A,B)

assoc(A,V) +

cut(A,B)

assoc(B,V) (2)

where, assoc(A, V) = u A, t V

w(u, t) is the total of

connections from the nodes of A to all nodes of V. Similarly, N assoc is denoted by:

Nassoc (A, B) = ( , )

( , )

assoc A A

assoc A V +

( , )

( , )

assoc B B

assoc B V (3)

where assoc (A, A) and assoc (B, B) are the weight totals of edges in A and in B respectively. The optimal division of graph

reduce to minimize not only NCut but also to maximize Nassoc in the partitions.

It is easy to see that Ncut(A, B)= 2 – Nassoc(A, B). This is an important property of decomposition. Because of two obtained criteria ((2) and (3)) from the decomposition algorithm, minimizing the dissociation between the parts and maximizing the association in each part, are identical in reality and can be satisfied simultaneously.

Unfortunately, minimizing the normalized cut is exactly complete-NP, even for the particular case of graph on grids. However, the authors in [5] also indicated that if the normalized cut problem is extended in the real value domain, then an approximate solution can be efficiently found.

C. Algorithm

Given a N-dimensional vector x, N =|V|, where xi = 1 if the node i is in A and xi = -1 if the node i is not in A. Let di =

jw(i, j) the total of connections from i to all other nodes.

Let D be a diagonal matrix N × N, d the main diagonal of the matrix, W a symmetric matrix N ×N with W(i, j) = w i j .

Ontology decomposition algorithm based on image segmentation consists of the following steps:

1) Transform a set of axioms A to an axiom graph G =

(V, E) with V = v|vV and E = (u,v)|u, vV,

w(u,v) = ||

||

vu

vu

.

2) Find the minimum value of NCut by solving : (D – W)x

= Dx to find eigenvectors corresponding to the

smallest eigenvalues .

3) Using eigenvectors with the second smallest eigenvector to decompose the graph into two parts. In the ideal case, the eigenvector only obtains two eigenvalues and the value signs propose a graph decomposition method.

4) Implementing the recursive algorithm on the two decomposed subgraphs.

The decomposition algorithm of TBox based on normalized cut [5] is illustrated by the procedure DIVISION-TBOX-NC. This procedure takes a TBox T with the set of axioms A as

input. It transforms A into an axiom graph G = (V, E), where

each axiom Ai of A is a vertex iV, each edge (i, j) E is

assigned by a weight w(i,j) =|)Ex(A )Ex(A|

|)Ex(A )Ex(A|

i

j

j

i

. Then, the

process is performed as the procedure TBOX-DIVISION in the figure 3.

The DIVISION-TBOX-NC uses the DIVISION-GRAPH-A procedure for dividing the axiom graph presenting T. This procedure takes the axiom graph G as input, compute the matrices W, D. W is a valued matrix NxN with w(i,j) computing as below. D is a diagonal matrix NxN with the

values d(i) = iw(i, j) on its diagonal. Then, we resolve the

equation (D-W)y = Dy with the constraints yTDe = 0 and


160

yi2, -2b, where e is a vector Nx1 to all ones, to find the smallest eigenvalues. The second smallest eigenvalue is chosen and it is the minimal value as NCut. We take the eigenvector that corresponds to this eigenvalue for dividing G into two parts G1, G2. Finally the GRAPH-DIVISION-A updates the variable Gp as in the method based on minimal separator. This procedure can be performed recursively, in each recursive call on Gi, it finds an eigenvector with the second smallest eigenvalue and the process continues on Gi.

TABLE 4. AN ALGORITHM FOR TRANSFORMING THE TBOX INTO AXIOM GRAPH

Input: the TBox T with the set of axioms A

Output: the decomposition graph Gp = (Vp, Ep) and T .

PROCEDURE DIVISION-TBOX-NC(A)

(1) Transform a set of axioms A to an axiom graph G = (V, E)

with V = v|v ∈ A and E = (u, v) |u , v ∈V, w (u,v ) =

||

||

vu

vu

(2) Let Gp = (Vp, Ep) an undirect graph with Vp = V and Ep = ∅

(3) Execute DIVISION-GRAPH-A(G = (V, E))

(4) For each v ∈ Vp, take Tv = A ∈ A |A = v. Return Tv , v

∈ V, p and Gp.

TABLE 5. AN ALGORITHM FOR DECOMPOSING AXIOM GRAPH

Input: the axiom graph G = (V, E)

Output: the decomposition graph Gp = (Vp, Ep)

PROCEDURE DIVISION-GRAPH-A(G (V, E))

(1) Find the minimal value of NCut et resolving the equation (D −W) x = λDx for eigenvectors with the smallest eigenvalues.

(2) Use the eigenvector with the second smallest eigenvalue for decomposing the graph into two sub-graphs G1, G2.

(3) Let Vp ←Vp \V∪V1, V2 and Ep ←Ep ∪ (V1, V2). We change the edges connecting to V for the links to one of V1, V2

(4) After the graph is divided to two parties, we can implement recursively:

DIVISION-GRAPH-A(G1), DIVISION-GRAPH-A(G2).

V. EXPERIMENT AND EVALUATION

Two graph decomposition algorithms based on minimal separator and image segmentation of ontology are implemented and return the same result satisfying decomposition criteria.

The first algorithm minimizes the shared number of symbols (|S

*| → minimum) and attempts to balance the number

of axioms between parts. After the decomposition, a sign to identify the axioms in obtained graph is cliques. However, some cliques do not present any axiom in the fact. Therefore we need a mechanism to determine the axioms.

The second algorithm obtains the advantage of preserving axioms. By the value of NCut, we can measure the independence between parts and the dependence between elements in every part. However, to install an efficient algorithm, a weighting function for the edges connecting the nodes of the graph axioms must be given.

Selecting of decomposition algorithm is based on structure of original ontology. For example, the second algorithm is used for ontology which has been presented with a lot of symbols while the first algorithm is more suitable to ontology which consists of many axioms.

We applied two decomposition algorithms based on the minimal separator and on the normalized cut to divide a TBox. In this section, we summarize some principal modules that are implemented in our experiments. To illustrate our results, we take a TBox extracting from the file “tambis.xml” in the system FaCT. This TBox, called Tambis1, consists of 30 axioms.

- Transform the ontology into a symbol graph: This module reads a file presenting a TBox in XML. The read file is transformed to a symbol graph. The figure 4 describes the symbol graph of Tambis1 TBox with the labeled vertices by the names of concept and role.

- Transform the ontology into an axiom graph: This module performs the same function with the above module, it results in an axiom graph. The figure 5 describes the axiom graph of tambis1 TBox with the labeled vertices by the symbols Ai (i = 0, .., 29).

- Decompose the ontology based on the minimal separator: decompose an axiom graph to a tree where the leaf nodes are axioms. The figure 6 presents this decomposition for Tambis1.

- Decompose the ontology based on the normalized cut: decompose a symbol graph to a tree where the leaf nodes are axioms. The figure 7 presents this decomposition for Tambis1.

These two methods return the results that satisfy the proposed properties of our decomposition. All the concepts, the roles and axioms are preserved after the decomposition execution. The axioms and their relationships are well expressed by the symbol graph and the axiom graph. The set of axioms in the original TBox decreases by regular distribution in sub-TBoxes. The decomposition techniques focus on finding a good decomposition. The method based on minimal separator minimizes the number of shared symbols between the components and tries to equalize axiom number in these parties. We need recover the axioms after decomposing. It is possible because the axioms were encoded by cliques in the symbol graph. However, in reality the difficult of this problem is that there are some cliques of symbol graph and of intersection graph that are not exactly axioms.

The possible advantage of the decomposition method based on the normalized cut is to conserver the axioms. After


161

decomposing, we can directly to find the axioms in the components. Furthermore, the measure of NCut is normalized, it expresses the dissociation between the different parties and the association in each party of decomposing. However, the effectiveness of this method depends on the choice of appropriate parameters for calculating the relation of similarity between two axioms.

We tested on the TBoxes in the FaCT system, as Vedaall,

modkit, people, platt, and tambis. The results show that for the

axioms whose expressions are more complex, the application

of normalized cut method is much more effective (e.g, Veda-

all, modkit), while the minimal separator method is better with

the simple axioms (e.g, platt, tambis).

VI. CONCLUSION

In this paper we have presented two techniques of decomposing from ontologies in Description Logics (TBox level). Our decomposition methods aim to reduce the number of GCIs [8], one of the main factors causing complexity to the algorithm argument. TBox separation method based on minimal separator only considers axioms in the aspect of syntax. We examine the simplest case, where the concept and role atoms are equivalent symbols in the axioms. However, in reality, they have different meanings. For example, the concept descriptions C t D and C u D will be represented by the same

symbol graph with symbols, although their meaning is different. Therefore, we will keep on developing methods of ontology separation taking into account of the dependence between symbols based on linking elements and the semantics of the axioms. Besides, we will examine the query processing issue on decomposed ontologies.

REFERENCES

[1] Boris Konev, Carsten Lutz, Denis Ponomaryov, Frank

Wolter, Decomposing Description Logic Ontologies, KR 2010.

[2] Chiara Del Vescovo, Damian D.G.Gessler, Pavel Klinov, Bijan Parsia, Decomposition and modular structure of BioPortal Ontologies, In proceedings of the 10th International Semantic Web Conference, Bonn, Germany, October 23-27, 2011.

[3] Dieter Jungnickel, Graphs,Networks and Algorithms. Springer1999.

[4] Eyal Amir and Sheila McIlraith, Partition-Based Logical Reasoning for First-Order and Propositional Theories. Artificial Intelligence, Volume 162, February 2005, pp.49-88.

[5] Jianbo Shi and Jitendra Malik, Normalized cuts and Image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888-905, August 2000.

[6] Kirill Shoikhet and Dan Geiger, Finding optimal triangulations via minimal vertex separators. In Proceedings of the 3rd International Conference, p. 270-281, Cambridge, MA, October 1992.

[7] Thi Anh Le PHAM and Nhan LE-THANH, Some approaches of ontology decomposing in Description Logics. In Proceedings of the 14th ISPE International Conference on Concurrent Engineering: Research and Applications, p.543-542, Brazil, July 2007.

[8] Thi Anh Le Pham, Nhan Le-Thanh, Peter Sander, Decomposition-based reasoning for large knowledge bases in description logics. Integrated Computer Aided Engineering (2008), Volume: 15, Issue: 1, Pages: 53-70.

[9] T.Kloks and D.Kratsch, Listing all minimal separators of a graph. In Proceedings of the 11th Annual Symposium on Theoretical Aspects of Computer Science, Spinger, Lecture Notes in Computer Science, 775, pp.759-768.

Figure 4. Symbol graph of Tambis1

Figure 7. decomposition

graph based on normalized

cut of Tambis1

Figure 6. decomposition

graph based on minimal

separator of Tambis1

Figure 5. Axiom graph of Tambis1


162

An Ontological Analysis of Common Research Interest for Researchers

Nawarat Kamsiang and Twittie Senivongse Computer Science Program, Department of Computer Engineering

Faculty of Engineering, Chulalongkorn University Bangkok, Thailand


Abstract—This paper explores a methodology and develops a tool to analyze common research interest for researchers. The analysis can be useful for researchers in establishing further collaboration as it can help to identify the areas and degree of interest that any two researchers share. Using keywords from the publications indexed by ISI Web of Knowledge, we build ontological research profiles for the researchers. Our methodology builds upon existing approaches to ontology building and ontology matching in which the comparison between research profiles is based on name similarity and linguistic similarity between the terms in the two profiles. In addition, we add the concept of depth weights to ontology matching. A depth weight of a pair of matched terms is determined by the distance of the terms from the root of their ontologies. We see that more attention should be paid to the matched pair that are located near the bottom of the ontologies than to the matched pair that are near the root, since they are more specialized areas of interest. A comparison between our methodology and an existing ontology matching approach, using OAEI 2011 benchmark, shows that the concept of depth weights gives better precision but lower recall.

Keywords-ontology building; ontology matching; profile matching

I. INTRODUCTION

Internet technology has become a major tool that enriches the way people interact, express ideas, and share knowledge. Through different means such as personal Web sites, social networking applications, blogging, and discussion boards, people express their opinions, interest, and knowledge in particular matters from which a connection or relationship can be drawn. A community of practice [1] can also be formed among a group of people who share common interest or a profession so that they can learn from and collaborate with each other. For academics and researchers, it is useful to know who do what as well as who share common interest for the purpose of potential research collaboration.

Different approaches have been taken to draw association between researchers. One is the use of bibliometrics to evaluate research activities, performance of researchers, and research specialization [2]. It is based on the enumeration and statistical analysis of scientific output such as publications, citations, and patents. Main bibliometric indicators are activity indicators and relational indicators. Activity indicators include the number of papers/patents, the number of citations, and the

number of co-signers indicating cooperation at national and international level. Relational indicators include, for example, co-publication which indicates cooperation between institutions, co-citation which indicates the impact of two papers that are cited together, and scientific links measured by citations which traces who cites whom and who is cited by whom in order to trace the influence between different communities. Another common approach is analysis of research profiles. Such profiles can be constructed by gathering or mining information from electronic sources such as Web sites, publications, blogs, personal and research project documents etc. It is followed by discovering researcher expertise as well as semantic correspondences between researcher profiles.

We are interested in the latter approach and use ontology as a means to describe research profiles. The idea that we explore is building ontological research profiles and using an ontology matching algorithm to compare similarity between profiles. To build an ontological research profile, we obtain keywords from the researcher’s publications that are indexed by ISI Web of Knowledge [3] and apply the Obtaining Domain Ontology (ODO) algorithm by An et al. [4] to build an ontology of terms that are related to the keywords. Terms in the profile are discovered by using WordNet lexical database [5]. To compare two research profiles, we adopt an algorithm called Multi-level Matching Algorithm with the neighbor search algorithm (MLMA+) proposed by Alasoud et al. [6], [7]. The algorithm considers name similarity and linguistic similarity between terms in the profiles. In addition, we add the concept of depth weights to ontology matching. A depth weight of a pair of matched terms is determined by the distance of the terms from the root of their ontologies. The motivation behind this is that we would pay more attention to a similar matched pair that are located near the bottom of the ontologies than to the matched pair that are near the root, since the terms at the bottom are considered more specialized areas of interest. A comparison between our methodology and MLMA+ is conducted using OAEI 2011 benchmark [8].

Section II of this paper discusses related work. Section III describes the algorithm for building ontological research profiles and a supporting tool. Section IV describes matching of the profiles. An evaluation of the methodology is presented


163

in Section V and the paper concludes in Section VI with future outlook.

II. RELATED WORK

Many researches analyze vast pools of information to find people with particular expertise, connection between these people, and shared interest among people. Some of these apply to research and academia. Tang et al. [9] present ArnetMiner which can automatically extract researcher profiles from the Web and integrate the profiles with publication information crawled from several digital libraries. The schema of the profiles is an extension of Friend-of-a-Friend (FOAF) ontology. They model the academic network using an author-conference-topic model to support search for expertise authors, expertise papers, and expertise conferences for a given query. Zhang et al. [10] construct an expertise network from posting-replying threads in an online community. A user’s expertise level can be inferred by the number of replies the user has posted to help others and whom the user has helped. Punnarut and Sriharee [11] use publication and research project information from Thai conferences and research council to build ontological research profiles of researchers. They use ACM computing classification system as a basis for expertise scoring, matching, and ranking. Trigo [12] extracts researcher information from Web pages of research units and publications from the online database DBLP. Text mining is used to find terms that represent each publication and then similarity between researchers with regard to their publications is computed. For further visualization of data, clustering and social network analysis are applied. Yang et al. [13] analyze personal interest in the online profile of a researcher and metadata of publications such as keywords, conference themes, and co-authors of the papers. By measuring similarity between such researcher data, social network of a researcher is constructed.

It is seen that in the approaches above, various mining techniques are used in extracting information and discovering knowledge about researchers and their relationships, and the major source of researcher information is bibliographic information in online libraries. We are interested in trying a different and more lightweight approach to finding similar interest between researchers and their degree of similarity. We focus on using an ontology building algorithm to create research profiles followed by an ontology matching algorithm to find similarity between the profiles.

III. BUILDING RESEARCH PROFILES

In this section and the next, we describe our methodology together with a supporting tool that has been developed. The first part of the methodology is building research profiles for researchers. Like other related work, keywords from researchers’ publications are used to represent research interest.

A. Researcher Information We retrieve researchers’ publication information during

ten-year period (year 2002-2011), i.e., author names, keywords, subject area, and year published, from ISI Web of Knowledge [3] and store in a MySQL database for the processing of the

Web-based tool developed by PHP. Using the tool (Fig. 1), we can specify a pair of authors, subject area, and year published and the tool will retrieve corresponding keywords from the database. The tool lists the keywords by frequency of occurrence, and from the list, we can select ones that will be used for building the profiles. In the figure, we use an example of two authors named B. Kijsirikul and C. A. Ratanamahatana under Computer Science area. Five top keywords are selected as starting terms for building their profiles.

B. Research Profile Building Algorithm In this step, we build a research profile as an ontology. We

follow the Obtaining Domain Ontology (ODO) algorithm proposed by An et al. [4] since it is intuitive and can automatically derive a domain-specific ontology from any items of descriptive information (i.e., keywords, in this case). The general idea is to augment the starting keywords with terms and hypernym (i.e., parent) relation from WordNet [5] to construct ontology fragments as Directed Acyclic Graphs. The iterative process of weaving WordNet terms and joining together the terms will tie ontology fragments into one ontology representing research interest. Fig. 2 and Fig. 3 are Kijsirikul’s and Ratanamahatana’s profiles built from their top five keywords. Steps in the algorithm and enhancements we make to tailor for ISI keywords are as follows.

1) Select starting keywords: Select keywords as starting terms. For Kijsirikul, they are Dimensionality reduction, Semi-supervised learning, Transductive learning, Spectral methods, and Manifold learning. For Ratanamahatana, they are Kolmogorov complexity, dynamic time warping, parameter-free data mining, anomaly detection, and clustering.

Figure 1. Specifying authors for profile building.


164

Figure 2. Kijsirikul’s ontological profile.

Figure 3. Ratanamahatana’s ontological profile.

2) Find hypernyms in WordNet: For each term, look it up in WordNet for its hypernyms. Since a term may have several hypernyms, for simplicity, we select one with the maximum tag count which denote that the hypernym of a particular sense (or meaning) is most frequently used and tagged in various semantic concordance texts. In Fig. 3, the starting term clustering has the term agglomeration as its hypernym.

If the term does not exist in WordNet but may be in a plural form (i.e., it ends with “ches”, “shes”, “sses”, “ies”, “ses”, “xes”, “zes”, or “s”), change to a singular form before looking up for hypernyms again. It is possible that one starting keyword may be found to be a hypernym of another. It is also possible that no hypernym is found for the term. If so, follow step 3).

3) Define hypernyms: If the term does not exist in WordNet, do any of the following.

a) Use subject area as hypernym: If the term is a single word or an acronym, use the subject area of the author as its hypernym. Some ISI subject areas contain “&”, so in this case the words before and after “&” become hypernyms. For example, if the subject area is Science & Technology, Science and Technology become two parents of the term.

b) Use the generalized form of the term as hypernym: This is in accordance with the lexico-syntactic pattern technique in [14] which considers syntactic patterns of

sentences to discover hypernym relations, but here we consider the pattern of the term. That is, if the term is a noun phrase consisting of a head noun and modifier(s), generalize the term by removing one modifier at a time and look up in WordNet. If found, use that generalized form as the hypernym. For example, in Fig. 2, the term Dimensionality reduction has reduction as the head noun and Dimensionality as the modifier, removing the modifier leaves us with the head noun reduction which can be found in WordNet, so reduction becomes the parent. In Fig. 3, the term parameter-free data mining has mining as the head noun, and parameter-free and data as modifiers. Removing parameter-free leaves us with the more generalized term data mining which can be found in WordNet and hence it becomes the parent. In the case that none of the generalized forms of the term are in WordNet, use the subject area as the hypernym.

Some ISI keywords comprise a main term and an acronym in different formats, e.g., finite element method (FEM) or PTPC (percutaneous transhepatic portal catheterization). We consider the main term and apply the technique above. Therefore, the parent of finite element method (FEM) is method and the parent of PTPC (percutaneous transhepatic portal catheterization) is catheterization.

4) Build up ontology: Several parent-child relations that result from finding hypernyms become ontology fragments. Repeat steps 2) and 3) to further interweave hypernym terms until no more hypernyms can be found.

5) Merge ontology fragments: The final step is to merge the ontology fragments. If a term is found in two ontology fragments, the fragments are joined. At a joined node, if there are several upward paths from the node to the roots (from different ontology fragments), we pick the shortest path for simplicity. In Fig. 2, five ontology fragments, each with a starting keyword as the terminal node, can merge at learning, knowledge, and psychological feature nodes respectively. Since merging at psychological feature results in one single ontology, the parents of psychological feature (i.e., abstraction -> entity) are dropped.

Another example that will be discussed in the next section is the profile of an author named A. Sudsang under Robotics area (Fig. 4). Five starting keywords are Grasping, grasp heuristic, Caging, positive span, and capture regions.

Figure 4. Sudsang’s ontological profile.


165

IV. MAT C H IN G RE S E A R C H PR O F IL E S

In this step, we match two ontological profiles. We adopt an effective algorithm called Multi-level Matching Algorithm with the neighbor search algorithm (MLMA+) proposed by Alasoud et al. [6], [7] since it uses different similarity measures to determine similarity between terms in the ontologies and also considers matching n terms in one ontology with m terms in another at the same time.

A. MLMA+ The original MLMA+ algorithm for ontology matching is

shown in Fig. 5. It has three phases.

1) Initialization phase: First, preliminary matching techniques are applied to determine similarity between terms in the two ontologies, S and T. Similarity measures that are used are name similarity (Levenshtein distance) and linguistic similarity (WordNet). Levenshtein distance determines the minimal number of insertions, deletions, and substitutions to make two strings equal [15]. For linguistic similarity, we determine semantic similarity between a pair of terms using a Perl module in WordNet::Similarity package [16]. Given Kijsirikul’s ontology S comprising n terms and Ratanamahatana’s ontology T comprising m terms, we compute a similarity matrix L(i, j) of size n x m which includes values in the range [0,1] called similarity coefficients, denoting the degree of similarity between the terms si in S and tj in T. A similarity coefficient is computed as an average of name similarity and linguistic similarity. For example, if Levenshtein distance between the terms s10 (change) and t23 (damage) is 0.2 and semantic similarity is 0.933, the similarity coefficient of these two terms is 0.567. The similarity matrix L for Kijsirikul and Ratanamahatana is shown in Fig. 6.

Then, a user-defined threshold th is applied to the matrix L to create a binary matrix Map0-1. The similarity coefficient that is less than the threshold becomes 0 in Map0-1, otherwise it is 1. In other words, the threshold determines which pairs of terms are considered similar or matched by the user. Fig. 6 also shows Map0-1 for Kijsirikul and Ratanamahatana with th = 0.5. It represents the state that s2 is matched to t14, s10 is matched to t14 and t23 etc. This Map0-1 becomes the initial state St0 for the neighbor search algorithm.

2) Neighbor search and evaluation phases: In this step, we search in the neighborhood of the initial state St0. Each neighbor Stn is computed by toggling a bit of St0, so the total number of neighbor states is n*m. An example of a neighbor state is in Fig. 7. The initial state and all neighbor states are evaluated using the matching score function v (1) of [6], [7]:

(1)

where k is the number of matched pairs and Map0-1 is Stn . The state with the maximum score is the answer to the matching; it indicates which terms in S and T are matched and the score represents the degree of similarity between S and T.

Figure 5. Ontology matching algorithm MLMA+ [6], [7].

Figure 6. L and initial Map0-1 based on MLMA+.

Figure 7. Example of a neighbor state of initial Map0-1 in Fig. 6.

B. Modification to MLMA+ We make a change to the initialization phase of MLMA+

by adding the concept of depth weights which is inspired by [17]. A depth weight of a pair of matched terms is determined by the distance of the terms from the root of their ontologies. The motivation behind this is that we would pay more attention to a similar matched pair that are located near the bottom of the ontologies than to the matched pair that are near the root, since the terms near the bottom are considered more specialized areas of interest. From Fig. 6, consider s2 = event and t14 = occurrence. The two terms have similarity coefficient = 0.51. They are relatively more generalized terms in the profiles compared to the pair s10 = change and t23 = damage with similarity coefficient = 0.567. But both pairs are equally considered as matched interest. We are in favor of the matched pairs that are relatively more specialized and are motivated to decrease the degree of similarity between generalized matched pairs by using a depth weight function w (2):

Algorithm Match (S, T) begin /* Initialization phase K ← 0 ; St0 ← preliminary_matching_techniques (S, T) ; Stf ← St0 ; /* Neighbor Search phase St ← All_Neighbors (Stn) ; While (K++ < Max_iteration) do /* Evaluation phase If score (Stn) > score (Stf) then Stf ← Stn ; end if Pick the next neighbor Stn St; St ← St – Stn ; If St = then return Stf ; end Return Stf ; end

thvjin

i

m

jMap

n

i

m

jjiLjiMapkLMapv

;,1 1

101 1

,,10

)10

(


166

wij = (rdepth(si) + rdepth(tj) ) / 2 ; wij is in (0,1] (2)

where rdepth(t) = relative distance of the term t from the root

of its ontology

= depth of the term in itsontology

height of ontology

t .

This depth weight will be multiplied with the similarity coefficient between si and tj to obtain a weighted similarity coefficient. Therefore the similarity matrix L(i, j) would change to include weighted similarity coefficients between the terms si and tj instead.

For s2 = event and t14 = occurrence in Fig. 2 and Fig. 3, rdepth(s2) = 2/8 and rdepth(t14) = 5/10. Their depth weight w would be 0.375 and hence their weighted similarity coefficient would change from 0.51 to 0.191 (0.375*0.51). But for s10 = change and t23 = damage, rdepth(s10) = 5/8 and rdepth(t23) = 7/10. Their depth weight w would be 0.663 and hence their weighted similarity coefficient would change from 0.567 to 0.376 (0.663*0.567). It is seen that the more generalized the matched terms, the more they are “penalized” by the depth weight. Any matched terms that are both the terminal node of the ontology would not be penalized (i.e., w =1). Fig. 8 shows the new similarity matrix L, with weighted similarity coefficients, and the new initial Map0-1 for Kijsirikul and Ratanamahatana where th = 0.35. Note that for the pair s2 = event and t14 = occurrence, and the pair s10 = change and t14 = occurrence they are considered matched in Fig. 6 but are relatively too generalized and considered unmatched in Fig. 7. For s10 = change and t23 = damage, they survive the penalty and are considered matched in both figures.

C. Matching Results of Example Table I shows matching results of the example when the

original MLMA+ and its modification are used. Both algorithms agree that Kijsirikul’s profile (Machine Learning) is more similar to Ratanamahatana’s (Data Mining) than Sudsang’s (Robotics). Matched pairs between Kijsirikul and Ratanamahatana are listed in Table II. MLMA+ gives a big list of matched pairs including those very generalized terms, while depth weights filter some out, giving a more useful list.

Figure 8. L and initial Map0-1 based on MLMA+ with depth weights.

TABLE I. MATCHING SCORES

Algorithm Author 1 Author 2 Matching Score Ratanamahatana 0.627

MLMA+ Kijsirikul Sudsang 0.581 Ratanamahatana 0.411 MLMA+ with

depth weights Kijsirikul

Sudsang 0.372

TABLE II. MATCHED PAIRS

Algorithm Matched Pairs MLMA+ (psychological feature, psychological feature),

(event, event), (event, occurrence), (event, change), (knowledge, process), (power, process), (power, event), (power, quality), (process, process), (process, processing), (act, process), (act, event), (act, change), (action, process), (action, change), (action, detection), (basic cognitive process, basic cognitive process), (change, event), (change, occurrence), (change, change), (change, damage), (change, deformation), (change of magnitude, change), (reduction, change), (knowledge, perception)

MLMA+ with depth weights

(basic cognitive process, basic cognitive process), (change, change), (change, damage), (change, deformation), (change, warping), (reduction, change), (reduction, detection), (reduction, damage), (reduction, deformation), (change of magnitude, deformation)

V. EVALUATION AND DISCUSSION Our ontology matching algorithm is evaluated using OAEI

2011 benchmark test sample suite [8]. The benchmark provides a number of test sets in a bibliographic domain, each comprising a test ontology in OWL language and a reference alignment. Each test ontology is a modification to the reference ontology #101 and is to be aligned with the reference ontology. Each reference alignment lists expected alignments. So in the test set #101, the reference ontology is matched to itself, and in the test set #n, the test ontology #n is matched to the reference ontology. The quality indicators we use are precision (3), recall (4), and F-measure (5).

no.of expected alignments found as matched by algo.Pr ecision

no.of matched pairs found by algo. (3)

no.of expected alignments found as matched by algo.Recall

no.of expected alignments (4)

2 x Precision x RecallF measure

Precision Recall

(5)

Table III shows the evaluation results with th = 0.5. We group the test sets into four groups. Test set #101-104 contain test ontologies that are more generalized or restricted than the reference ontology by removing or replacing OWL constructs that make the concepts in the reference ontology generalized or restricted. Test set #221-247 contain test ontologies with structural change such as no specialization, flattened hierarchy, expanded hierarchy, no instance, no properties. The quality of both algorithms with respect to these two groups is quite similar since these modifications do not affect string-based and linguistic similarities which are the basis of both algorithms. Test set #201-210 contain test ontologies which relate to change of names in the reference ontology, such as by renaming with random strings, misspelling, synonyms, using certain naming convention, and translation into a foreign language. Both algorithms are more sensitive to this test set. Test set #301-304 contain test ontologies which are actual bibliographic ontologies.

According to an average F-measure, MLMA+ with depth weights is about the same quality as MLMA+ as it gives better precision but lower recall. MLMA+ discovers a large number


167

of matched pairs whereas depth weights can decrease this number and hence precision is higher. But at the same time, recall is affected. This is because the reference alignments only lists pairs of terms that are expected to match. That is, for example, if the test ontology and the reference ontology contain the same term, the algorithm should be able to discover a match. But MLMA+ with depth weights considers the presence of the terms in the ontologies as well as their location in the ontologies. So an expected alignment in a reference alignment may be considered unmatched if they are near the root of the ontologies and are penalized by the algorithm.

The user-defined threshold th in the initialization phase of MLMA+ is a parameter that affects precision and recall. If th is too high, only identical terms from the two ontologies would be considered as matched pairs (e.g., (psychological feature, psychological feature)), and these identical pairs mostly are located near the root of the ontologies. We see that discovering only identical matched pairs are not very interesting given that the benefit of using WordNet and linguistic similarity between non-identical terms is not present in the matching result. On the contrary, if th is too low, there would be proliferation of matched pairs because, even a matched pair is penalized by depth weight, its weighted similarity coefficient remains greater than the low th. The values th that we use for the data set in the experiment trades off these two aspects; it is the highest threshold by which the matching result contains both identical and non-identical matched pairs.

The complexity of the ODO algorithm for building an ontology S depends on the number of terms in S and the size of the search space when joining any identical terms in S into single nodes, i.e., O( ) where the number of ontology terms n = number of starting keywords * depth of S, given that, in the worst case, all starting keywords have the same depth. For MLMA+ and MLMA+ with depth weights, the complexity depends on the size of the search space when matching two ontologies S and T, i.e., O((n*m)2) when n and m are the size of S and T respectively.

VI. CONCLUSION

This work presents an ontology-based methodology and a supporting Web-based tool for (1) building research profiles from ISI keywords and WordNet terms by applying the ODO algorithm, and (2) finding similarity between the profiles using MLMA+ with depth weights. An evaluation using the OAEI 2011 benchmark shows that depth weights can give good precision but lower recall.

TABLE III. EVALUATION RESULTS

MLMA+ MLMA+ with Depth Weights Test Set Prec. Rec. F-

measure Prec. Rec. F-measure

#101-104 0.74 1.0 0.85 0.93 0.84 0.88 #201-210 0.35 0.24 0.26 0.68 0.18 0.27 #221-247 0.71 0.99 0.82 0.94 0.66 0.75 #301-304 0.56 0.75 0.64 0.90 0.57 0.68

Average 0.59 0.74 0.64 0.86 0.56 0.64

For future work, further evaluation using a larger corpus and evaluation on performance of the algorithms are expected. An experience report on practical use of the methodology will be presented. It is also possible to adjust the ontology matching step so that the structure of the ontologies and the context of the terms are considered. In addition, we expect to explore if the methodology can be useful for discovering potential cross-field collaboration.

REFERENCES

[1] A. Cox, “What are communities of practice? A comparative review of

four seminal works,” J. of Information Science, vol. 31, no. 6, pp. 527-540, December 2005.

[2] Y. Okubo, Bibliometric Indicators and Analysis of Research Systems: Methods and Examples, Paris: OECD Publishing, 1997.

[3] ISI Web of Knowledge, http://www.isiknowledge.com, Last accessed: January 24, 2012.

[4] Y. J. An, J. Geller, Y. Wu, and S. A. Chun, “Automatic generation of ontology from the deep Web,” in Procs. of 18th Int. Workshop on Database and Expert Systems Applications (DEXA’07), 2007, pp. 470-474.

[5] WordNet, http://wordnet.princeton.edu/, Last accessed: January 24, 2012.

[6] A. Alasoud, V. Haarslev, and N. Shiri, “An empirical comparison of ontology matching techniques,” J. of Information Science, vol. 35, pp. 379-397, March 2009.

[7] A. Alasoud, V. Haarslev, and N. Shiri, “An effective ontology matching technique,” in Procs. of 17th Int. Conf. on Foundations of Intelligent Systems, 2008, pp. 585-590.

[8] Ontology Alignment Evaluation Initiative 2011 Campaign, http://oaei.ontologymatching.org/2011/, Last accessed: January 24, 2012.

[9] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “ArnetMiner: Extraction and mining of academic social networks,” In Procs. of 14th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD 2008), 2008, pp. 990-998.

[10] J. Zhang, M. Ackerman, and L. Adamic, “Expertise network in online communities: structure and algorithms,” In Procs. of 16th Int. World Wide Web Conf. (WWW 2007), 2007, pp. 221-230.

[11] R. Punnarut and G. Sriharee, “A researcher expertise search system using ontology-based data mining,” in Procs. of 7th Asia-Pacific Conference on Conceptual Modelling (APCCM 2010), 2010, pp. 71-78.

[12] L. Trigo, “Studying researcher communities using text mining on online bibliographic databases,” In Procs. of 15th Portuguese Conf. on Artificial Intelligence, 2011, pp. 845-857.

[13] Y. Yang, C. A. Yueng, M. J. Weal, and H. C. Davis, “The researcher social network: A social network based on metadata of scientific publications,” In Procs. of Web Science Conf. 2009 (WebSci 2009), 2009.

[14] M. A. Hearst, “Automated discovery of WordNet relation,” in Wordnet: An Electronic Lexical Database and Some of its Applications, Cambridge, MA: MIT Press, 1998, pp. 132-152.

[15] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, vol.33, pp. 31-88, March 2001.

[16] Wordnet::Similarity, http://sourceforge.net/projects/wn-similarity, Last accessed: January 24, 2012.

[17] H. Yang, S. Liu, P. Fu, H. Qin, and L. Gu, “A semantic distance measure for matching Web services,” in Procs. of Int. Conf. on Computational Intelligence and Software Engineering (CiSE), 2009, pp. 1-3.

2

n


168

Automated Software Development Methodology:

An agent oriented approach

Sudipta Acharya

Dept. of Information technology National Institute of Technology

Durgapur, India [email protected]

Prajna Devi Upadhyay

Dept. of Information technology

National Institute of Technology Durgapur, India

[email protected]

Animesh Dutta

Dept. of Information technology

National Institute of Technology

Durgapur, India

[email protected]

Abstract— In this paper, we propose an automated software

development methodology. The methodology is conceptualized

with the notion of agents, which are autonomous goal-driven

software entities. They coordinate and cooperate with each other,

like humans in a society to achieve some goals by performing a

set of tasks in the system. Initially, the requirements of the newly

proposed system are captured from stakeholders which are then

analyzed in goal oriented model. Finally, the requirements are

specified in the form of goal graph, which is the input to the

automated system. Then this automated system generates MAS

(Multi Agent System) architecture and coordination of the agent

society to satisfy the set of requirements by consulting with the

domain ontology of the system.

Keywords-Agent; Multi Agent System;Agent Oriented Software

Engineering; Domain Ontology; MAS Architecture; MAS

Coordination; Goal Graph.

I. INTRODUCTION

A. Agent and Multi agent system

An agent[1, 2] is a computer system or software that can act autonomously in any environment, makes its own decisions about what activities to do, when to do, what type of information should be communicated and to whom, and how to assimilate the information received. Multi-agent systems (MAS) [1, 2] are computational systems in which two or more agents interact or work together to perform a set of tasks or to satisfy a set of goals.

B. Agent Oriented Software Engineering

The advancement from assembly level programming to

procedures and functions and finally to objects has taken place

to model computing in a way we interpret the world. But there

are inherent limitations in an object that makes it incapable of

modeling a real world entity. It was for this reason that we

move to agents and Multi agent systems, which model a real

world entity in a better way. As agent technology has become

more accepted, agent oriented software engineering (AOSE)

also has become an important topic for software developers

who wish to develop reliable and robust agent-based software

systems [3, 4, 5]. Methodologies for AOSE attempt to provide

a method for engineering practical multi agent systems.

Recently, transformation systems based on formal models to

support agent system synthesis are emerging fields of

research. There are currently few AOSE methodologies for

multi agent systems, and many of those are still under

development.

II. RELATED WORK

Recent work has focused on applying formal methods to

develop a transformation system to support agent system

synthesis. Formal transformation systems [6, 7, 8] provide

automated support to system development, giving the designer

increased confidence that the resulting system will operate

correctly, despite its complexity. In [9] authors have proposed

a Goal oriented language GRL and a scenarios oriented

architectural notation UCM to help visualize the incremental

refinement of architecture from initially abstract description.

But the methodology proposed is informal and due to this the

architecture will vary from developer to developer. In [10, 11]

a methodology for multi agent system development based on

goal model is proposed. Here, MADE (Multi Agent

Development Environment) tool has been developed to reduce

the gap between design and implementation. The tool takes the

agent design as input and generates the code for

implementation. The agent design has to be provided

manually. Automation has not been shown for generation of

design from user requirements. A procedure to map the

requirements to agent architecture is proposed in [12]. The

TROPOS methodology for building agent oriented software

system is introduced in [13]. But the methodologies proposed

in both [12] and [13] are informal approaches.

III. SCOPE OF WORK

There are very few AOSE methodologies for automated

design of the system from user requirements. But, most of the

work follows an informal approach due to which the system

design may not totally satisfy the user requirements. Also the

system design varies from developer to developer. There is a

need to reduce the gap between the requirements specification

and agent design and to develop a standard methodology

which can generate the design from user requirements

irrespective of the developers. In this work we have

concentrated on developing a standard methodology by which


169

we can generate the design of software from user requirements

which will be developer independent.

In this paper, we develop an automated system which takes

the user requirements as input and generates the MAS

architecture and coordination with the help of domain

knowledge. The basic requirements are analyzed in a goal

oriented fashion [14] and represented in the form of goal graph

while the domain knowledge is represented with the help of

ontology [15]. The output of the developed system is MAS

architecture which consists of a number of agents and their

capabilities and MAS coordination represented through Task

Petri Nets. The Task Petri Nets tool can model the

coordination among the agents to maintain the inherent

dependencies between the tasks.

IV. PROPOSED METHODOLOGY

Fig. 1 represents the architecture of our proposed automated system. The basic requirements are taken from the user as input. Since Requirements Analysis is an informal process, the input requirements can be captured from the user in the form of a text file or any other desirable format. These requirements are further analyzed and represented in the form of a Goal Graph. The domain knowledge is also an input and is represented in the form of ontology. The automated system returns the MAS Architecture and MAS Coordination as output. The MAS Coordination is represented in the form of Task Petri Nets. Thus, the automated system takes the requirements and the domain knowledge as input and generates the MAS Architecture and MAS Coordination as output. So, we can say,

Figure 1. Architecture of the proposed automated system

MAS Architecture= f (Requirements, Domain ontology)

MAS Coordination=f (Requirements, Domain ontology, MAS architecture)

The architecture of the proposed automated system is described in the following sub-section.

A. Requirements represented by Goal Graph

Agents in MAS are goal oriented i. e. all agents perform collaboratively to achieve some goal. The concept of goal has been used in many areas of Computer Science for quite some time. In AI, goals have been used in planning to describe desirable states of the world since the 60s. More recently, goals

have been used in Software Engineering to model requirements and non-functional requirements for a software system. Formally we can define Goal Graph as G=(V, E), consisting of

A set of nodes V=V1, V2,…,Vn where each Vi is a goal to be achieved in a system, 1<=i<=n.

A set of edges E. There are two types of edges, represented by and .

A function, subgoal : (V×V) Bool. subgoal (Vi, Vj)=true if Vj is an immediate sub goal of Vi.

Function hb: (VxV)Bool. hb(Vi, Vj)=true if the user specifies that goal represented by Vi should be satisfied before the goal represented by Vj is satisfied.

An edge exists between two vertices Vi and Vj

if subgoal(Vi,Vj)=true, Vi, Vj∈ V.

An edge exists between two vertices Vi and

Vj if hb(Vi,Vj)=true, Vi, Vj∈ V.

B. Domain Knowledge represented by Ontology

A domain ontology [15] defined on a domain M is a tuple O = (TCO, TRO, I, conf, <, B, IR), where we have extended the tuple definition by adding another function IR as per our requirements.

TCO= c1, c2,…,ca, is the set of concept types defined in domain M. Here, TCO= task, goal. In diagram, a concept is represented by

TRO= consists_of, happened_before. In diagram, a relation is represented by

I is the set of instances of TCO, from the domain M.

conf: ITCO, it associates each instance to a unique concept type.

≤: (TC X TC) ∪ (TR X TR) true, false, <(c,d)=true indicates c is a subtype of d

B: TR -> ℘(TC) where ∀ TR B(r) = c1,..,cn, where n is a variable associated with r. The set c1,..,cn is an ordered set and could contain duplicate elements. Each ci is called an argument of r. The number of elements of the set is called the arity (or valence) of the relation. B(consists)=goal,…,goal, task,…,task

IR: TRO℘(I), where ∀ ai ∈ ℘(I), if ai is the ith element of ℘(I), then conf(ai) = ti, where ti is the ith element of B(TRO).

C. Semantic Mapping from requirements to Ontology

Concepts

The process by which the basic keywords of the leaf node sub goal are mapped into concepts in the Ontology is called Semantic Mapping. In this paper, the aim of semantic mapping is to find out tasks from Domain Ontology, required to be performed to achieve a sub-goal given as input from the Goal Graph. Let there be a set of task concepts T=t1,t2...tn associated with a consists_of relation in an ontology. Let there


170

be set of goal concepts G=g1,g2...gm also associated with that consists_of relation. Now let from user side requirements come of which after Requirements Analysis goal graph consists set of leaf node sub goals, G0=G1,G2,G3....Gp. Let ky be a function that maps a sub-goal to its set of keywords. The set of keywords for sub goal Gi Є G0 can be represented by ky(Gi)=ky1,ky2...kyd. Now the set of tasks T will be performed to achieve sub goal Gi iff either the mapping

f : ky(Gi)G is a bijective mapping where Gi Є G0 , or

there exists a subset of GO, Gi, Gj ,…, Gk⊆ GO, such that

f : ky(Gi) U ky(Gj) U....U ky(Gk) G is a bijective mapping.

D. MAS Architecture

MAS architecture consists of set of agents with their capability sets i.e set of tasks that an agent can perform. Formally we can define agent architecture as,

< AgentID, capability set>

where AgentID is unique identification number of agent, and capability set is set of tasks t1, t2,......,tn that the corresponding agent is able to perform. MAS architecture can be defined as a set of agents with their corresponding architectures.

E. MAS Coordination represented by Task Petri Nets

A Task Petri Nets is an extended Petri Nets tool that can model the MAS coordination. It is a six tuple, TPN = (P, TR, I, O, TOK, Fn) where

P is a finite set of places. There are 8 types of places, P= Pt ∪ Ph ∪ Pc ∪ Pe ∪ Pf ∪ Pr ∪ Pa∪ Pd. Places Ph, Pc, Pe, Pf exist for each task already identified by the interface agent. The description of the different types of places is:

1. Ph: A token in this place indicates that the task represented by this place can run, i.e. all the tasks that were required to be completed for this task to run are completed.

2. Pc: A token in this place indicates that an agent has been assigned for this task.

3. Pe: A token in this place indicates agent and resources have been allocated for the task represented by the place and the task is under execution by the allocated agent.

4. Pf: A token in this place indicates that the task represented by this place has finished execution.

5. Pr: such a place exists for each type of a resource in

the system ,∀ ri ∃ Pri ri ∈ R, 1≤i≤q

6. Pa: such a place exists for each instance of an agent in

the system ,∀ ai ∃ Pai ai ∈ A, 1≤i≤p

7. Pt: it is the place where the tasks identified by the interface agent initially reside.

8. Pd: such place is created dynamically after the agent has been assigned for the task and the agent decides to divide the tasks into subtasks. For each subtask, a new place is created.

TR is the set of transitions. There are 5 types of transitions TR=th ∪ tc ∪ te ∪ tf ∪ td, where th , te , tf exist for every task identified by the interface agent.

1. th: This transition fires if the task it represents is enabled i.e. all the tasks which should be completed for the task to start are complete.

2. tc: This transition fires if the task it represents is assigned an agent which is capable of performing it.

3. te: This transition fires if the all resources required by the task it represents are allocated to it.

4. tf: This transition fires if the task represented by the transition is complete.

5. td: This transition is dynamically created when the agent assigned for the task it represents decides to split the task further into sub-tasks. The subnet that is formed dynamically consists of places and transitions all of which are categorized as Pd or td respectively.

I is the set of input arcs, which are of the following types

1. I1=Pt X th: task checked for dependency

2. I2=PrX te: request for resources

3. I3=Pe X tf: task completed

4. I4=Pf X th: interrupt to successor task

5. I5=PcXtd∪ PaXtd∪ PrXtd∪ PdXtf: input arcs of the subnet formed dynamically

O is the set of output arcs, which are of the following types:

1. O1: thXPh: task not dependent on any other task

2. O2: tcXPc: agent assigned

3. O3: teXPe: resource allocated

4. O4: tfXPr: resource released

5. O5: tfXPf: Task completed by agent

6. O6: tfXPa: agent released

7. O7: tdXPd: output arcs of the subnet formed dynamically

TOK is the set of color tokens present in the system, TOK=TOK1,TOK2,…,TOKX, where each TOKi, 1≤i≤x, is associated with a function assi_tok defined as:

assi_tok: TOK Category X Type X N, where, Category = set of all categories of tokens in the system= T, R, A, Type = set of all types of each categoryi ∈ Category i.e. Type= T ∪ R ∪ A, N is the set of natural numbers. Let assi_tok(TOKi)=(categoryi,


171

typei, ni). The function assi_tok satisfies the following constraint:

∀ TOKi (categoryi=R)(typei ∈ R) ∧ (1≤ni≤ inst_R(typei))

∀ TOKi (categoryi=A)(typei ∈ A) ∧ (1≤ni≤ inst_A(typei))

∀ TOKi (categoryi=T)(typei ∈ T) ∧ (ni=1)

assi_tok defines the category, type and number of instances of each token.

Fn is a function associated with each place and token. It is defined as:

Fn: P X TOK℘(TIME X TIME). For a token TOKk ∈ TOK, 1≤k≤x, and place Pl ∈ P, Fn(Pl, TOKk)=(ai, aj), ai is the entry time of TOKk to place Pl and aj is the exit time of TOKk from place Pl. For a token entering and exiting a place multiple times, |Fn(Pl,TOKk)|=number of times TOKk entered the place Pl.

The process by which MAS architecture and MAS coordination is generated from requirements is shown as a flowchart in Fig. 2.

Figure 2. Flowchart of the proposed methodology


172

V. CASE STUDY

Let us start with the case study by applying our proposed methodology. We take “Library system” as our case study application. Fig. 3 shows the ontology of a Library System. The ontology consists of some concepts and relations. There is a TASK concept type in the ontology which describes the task that should be performed to achieve some goal. There are other concepts that collectively describe some sub-goal to be achieved in the library system. For e.g. the concept types “Check”, “Validity”, “Member”, collectively describe the sub-goal “Check the membership validity”. There are two types of relations i) consists_of and ii) happened_before. Here we denote “happened_before” relationship by “H-B”. The “consists_of” relation exists between some set of concepts describing a sub- goal and a set of instances of TASK concept. The “happened_before” relation exists between two instances of TASK concept. In the figure, the “consists_of” relation has incoming arcs from the concepts types “Check”, “Validity, “Member” and outgoing arcs to TASK concepts “Get library identity card of member”, “Check for validity of that ID card”. It means that these two tasks have to be performed to achieve sub-goal described by concepts “Check”, “Validity”, “Member” i.e. tasks “Get library identity card of member” and “Check for validity of that ID card” have to be performed to

achieve sub-goal “Check the membership validity”. The happened_before relationship exists between these two tasks which means that task “Check for validity of that ID card” cannot start until task “Get library identity card of member” is completed. Now consider from user side requirements come as “Delete account of member with member id and book with book id <j> from database.”It is the main goal.

Step 1: We have to perform goal oriented Requirements

Analysis of main requirements. It is an informal process

performed by the Requirements Analysts and after

Requirements Analysis, it is represented by Goal Graph shown

in Fig. 4.

Step 2: The leaf node sub goals are given as input to the

automated system. By semantic mapping [16, 17] system maps

each basic keyword of leaf sub goals to the goal concepts of

ontology, and finds out set of tasks required to be performed to

achieve those sub-goals. This is shown in Fig. 5.

Step 3: The tasks that we get from step 2 are used to form task

graph. Dependency between these tasks is known from

ontology. The task graph is shown in Fig. 6 where task A

implies “Check that requirement of book id <j> < threshold, if

yes then continue, else stop.

Figure 3. Ontology Diagram of Library System


173

Figure 4. Goal Graph representation of basic requirements

Figure 5. Procedure for Semanic Mapping

Task B implies “Check both book & member database

whether member id has not returned book, and any book

<j> is not returned by any member.”

i. e in task B two checking operations are there.

Task C implies “Delete member id account.”

Task D implies “Remove book id <j> from library”

Task E implies “Delete entry of book id <j> from database”

Figure 6. Task Graph for the set of tasks found from Semantic Mapping

Step 4: Using Task Graph of Fig. 6, we find out the number of

agents and their capability set following the methodology

shown as a flowchart in Fig. 2. The maximum number of

concurrent agents at any level is 2, so we create 2 agents- A1

and A2. Let C be assigned to A1’s capability set and D to A2’s

capability set, <A1, C>, <A2, D>. Both C and D have

single predecessor, B. So, B is added to the capability set of

either A1 or A2. Let it be added to the capability set of A1. So,

we have <A1, B, C>. Now, B has a single predecessor, A.

So, A is added to the capability set of A1. So, we have <A1,

A, B, C>. There are no other predecessors at level higher

than A. D has a single successor, E. So, E is added to the

capability set of A2. So, we have <A2, D, E>. The total

number of agents deployed is 2 and the MAS architecture is

<A1, A, B, C>, <A2, D, E>.

Step 5: Using the Task Graph of Fig. 6 and MAS architecture

developed in step 4, MAS coordination is formed i.e. to satisfy

user requirements, how a set of required agents (A1, A2) will

perform a set of required tasks (A, B, C, D, E) collaboratively

can be represented by Task Petri Nets shown in Fig. 7.

Figure 7. Task Petri Nets representation of MAS coordination

VI. CONCLUSION

In this paper, we have developed an automated system to

generate MAS architecture and coordination from the user

requirements and domain knowledge. It is a formal

methodology which is developer independent i.e. it produces

same MAS architecture and coordination for the same set of

requirements and domain knowledge. The future work is to

include a verification module to check whether the developed

architecture satisfies the requirements. The module can work

at two levels, firstly after Requirements Analysis, it can check

whether the analysis satisfies main requirements, and

secondly, it can verify whether the MAS coordination satisfies

main requirements.


174

REFERENCES

[1] G. Weiss, Ed., Multiagent systems: a modern approach to distributed artificial intelligence, MIT Press (1999)

[2] M. J. Wooldridge, Introduction to Multiagent Systems, John Wiley & Sons, Inc(2001)

[3] N. Jennings, On Agent-based Software Engineering, Artificial Intelligence: 117 (2000) 277-296

[4] J. Lind, “Issues in Agent-Oriented Software Engineering”, In P. Ciancarini , M. Wooldridge (eds.), Agent-Oriented Software Engineering: First International Workshop”, AOSE 2000. Lecture Notes in Artificial Intelligence, Vol. 1957. Springer-Verlag, Berlin

[5] M. Wooldridge, , P. Ciancarini, “Agent-Oriented Software Engineering: the State of the Art”, In P. Ciancarini, , M. Wooldridge, (eds.), Agent-Oriented Software Engineering: First International Workshop, AOSE 2000. Lecture Notes in Artificial Intelligence, Vol. 1957. Springer-Verlag, Berlin Heidelberg (2001) 1-28

[6] C. Green, D. Luckham, R. Balzer , et al, “Report on a Knowledge-Based Software Assistant”. In C. Rich, , R. C. Waters, (eds.), Readings in Artificial Intelligence and Software Engineering. Morgan Kaufmann, San Mateo, California (1986) 377-428

[7] T. C. Hartrum , R Graham, “The AFIT Wide Spectrum Object Modeling Environment: An AWESOME Beginning”, Proceedings of the National Aerospace and Electronics Conference. IEEE (2000) 35-42

[8] R. Balzer, T. E. Cheatham, Jr., and C. Green, “Software Technology in the 1990’s: Using a new Paradigm,”, Computer, pp. 39-45, Nov 1983.

[9] L. Liu, E. Yu, “From Requirements to Architectural Design –Using Goals and Scenarios”.

[10] Z. Shen, C. Miayo, R. Gay, D. Li, “Goal Oriented Methodology for Agent System Development”, IEICE TRANS. INF. & SYST., VOL.E89–D, NO.4 APRIL 2006.

[11] S. Zhiqi, “Goal oriented Modelling for Intelligent Agents and their Applications”, Ph.D. Thesis, Nanyang Technological University, Singapore, 2003.

[12] Clint H. Sparkman, Scott A. DeLoach, Athie L. Self, “Automated Derivation of Complex Agent Architectures from Analysis Specifications”, Proceedings of the Second International Workshop On Agent-Oriented Software Engineering (AOSE-2001), Montreal, Canada, May 29th 2001.

[13] P. Bresciani, A. Perini, P. Giorgini, F. Giunchiglia, J. Mylopoulos, “Tropos: An Agent-Oriented Software Development Methodology”, Autonomous Agents and Multi-Agent Sytems, 8, 203–236, 2004.

[14] Paolo Giorgini, John Mylopoulos, and Roberto Sebastiani, “Goal-Oriented Requirements Analysis and Reasoning in the Tropos Methodology”, Engineering Applications of Artificial Intelligence,Volume 18, 159-171, 2005.

[15] P.H.P. Nguyen, D. Corbett, “A basic mathematical framework for conceptual graphs”, In: IEEE Transactions on Knowledge and Data Engineering, Volume 18, Issue 2, 2005.

[16] Haruhiko Kaiya, Motoshi Saeki, “Using Domain Ontology as Domain Knowledge for Requirements Elicitation”, 14th IEEE International Requirements Engineering Conference (RE'06)

[17] Masayuki Shibaoka, Haruhiko Kaiya, and Motoshi Saeki, “GOORE : Goal-Oriented and Ontology Driven Requirements Elicitation Method”, J.-L. Hainaut et al. (Eds.): ER Workshops 2007, LNCS 4802, pp. 225–234, 2007.


175

Agent Based Computing Environment for Accessing

Previleged Servics

Navin Agarwal

Dept. of Information Technology

National Institute of Technology, Durgapur

[email protected]

Animesh Dutta

Dept. of Information Technology

National Institute of Technology, Durgapur

[email protected]

Abstract— In this paper we propose an application for

accessing privileged services on the web, which is deployed on

JADE (Java Agent Development Framework) platform. There

are many Organizations/ Institutes which have subscribed to

certain services inside their network, and these will not be

accessible to people who are a part of the

Organization/Institute when they are outside their network

(for example in his residence). Therefore we have developed

two software agents; the person will request the Client Agent

(which will be residing outside the privileged network) for

accessing the privileged services. The Client Agent will interact

with the Server Agent (which will be residing inside the

network which is subscribed to privileged services), which will

process the request, and send the desired result back to the Client Agent.

I. INTRODUCTION

Many Organizations/Institutes have subscription to certain services inside their network, for example here at NIT Durgapur there is subscription of IEEE and ACM. When outside the network, these services cannot be accessed. We plan to address this problem and also automate the whole process so that so that human effort is reduced.

To solve the problem we will build an agent based system, where multiple agents will interact with each other to solve the problem. When we talk about multiple agents interacting, the system becomes a Multi-Agent system, descriptions of which are given below.

A. Agent

An agent is a computer system or software that can act autonomously in any environment. Agent autonomy relates to an agent’s ability to make its own decisions about what activities to do, when to do, what type of information should be communicated and to whom, and how to assimilate the information received. An agent in the system is considered a locus of problem-solving activity; it operates asynchronously with respect to other agents. Thus, an intelligent agent inhabits an environment and is capable of conducting autonomous actions in order to satisfy its design objective [1-5]. Generally speaking, the environment is the aggregate of surrounding things, conditions, or influences with which the agent is interacting. Data/information is “sensed” by the agent. This data/information is typically called “percepts”.

The agent operates on the percepts in some fashion and generates “actions” that could affect the environment. This general flow of activities, i.e., sensing the environment, processing the sensed data/ information and generating actions that can affect the environment, characterizes the general behavior of all agents.

B. MAS(Multi Agent System)

Multi-agent systems (MASs) [2, 5] are computational systems in which two or more agents interact or work together to perform a set of tasks or to achieve some common goals [5-8]. Agents of a multi-agent system (MAS) need to interact with others toward their common objective or individual benefits of themselves. A multi-agent system can be studied as a computer system that is concurrent, asynchronous, stochastic and distributed. A multi agent system permits to coordinate the behavior of agents, interacting and communicating in an environment, to perform some tasks or to solve some problems. It allows the decomposition of complex task in simple sub-tasks which facilitates its development, testing and updating.

The client agent outside the network which is subscribed to certain services need to interact with some agent residing inside the network, which will do the work on behalf of the user and send him back the result. In this paper we propose the whole architecture of the system, and how different agents will interact with each other.

To develop the MAS, we will use JADE which is a software framework fully implemented in Java language. It simplifies the implementation of multi-agent systems through a middle-ware that claims to comply with the FIPA specifications and through a set of tools that supports the debugging and deployment phase. The agent platform can be distributed across machines (which not even need to share the same OS) and the configuration can be controlled via a remote GUI. The configuration can even be changed at run-time by creating new agents and moving agents from one machine to another one, as and when required. The only system requirement is the Java Run Time version 5 or later.

The communication architecture offers flexible and efficient messaging, where JADE creates and manages a queue of incoming ACL messages, private to each agent. Agents can access their queue via a combination of several


176

modes: blocking, polling, timeout and pattern matching based. The full FIPA communication model has been implemented and its components have been clearly distinct and fully integrated: interaction protocols, envelope, ACL, content languages, encoding schemes, ontologies and, finally, transport protocols. The transport mechanism, in particular, is like a chameleon because it adapts to each situation, by transparently choosing the best available protocol. Most of the interaction protocols defined by FIPA are already available and can be instantiated after defining the application-dependent behavior of each state of the protocol. SL and agent management ontology have been implemented already, as well as the support for user-defined content languages and ontologies that can be implemented, registered with agents, and automatically used by the framework.

II. RELATED WORK

Agent-based models have been used since the mid-1990s to solve a variety of business and technology problems. Examples of applications include supply chain optimization [9] and logistics [10], distributed computing [11], workforce management [12], and portfolio management [13]. They have also been used to analyze traffic congestion [14]. In these and other applications, the system of interest is simulated by capturing the behavior of individual agents and their interconnections. In this paper [15] a framework for constructing application in mobile computing environment has been proposed. In this framework an application is partitioned into two pieces, one runs on a mobile computer and another runs on a stationary computer. They are constructed by composing small objects, in which the stationary computer does the task for the mobile computer. This system is based on the service proxy, and is not autonomous. In our work we are building our system based on agents which adds a lot of flexibility and is autonomous. A Multi-Agent system [16] for accessing remote energy meters from electricity board is related to this work. In this the server said to be the host is located in the electricity board, and all the customers are the clients connected with the server. This MAS system helps in automating the task and thus replacing the human agents. It is similar to our scenario where we are automating the task of downloading papers from IEEE/ACM sites, and replacing human agents which can do the task being inside the privileged network. In this paper [17] architecture has been proposed for secure and simplified access to home appliances using Iris recognition, adding an additional layer of security and preventing unauthorized access to the home appliances. This model is also based on server and client approach, where the server will reside inside the home, and client will reside outside the home and send request to the server for performing task on behalf of the user. Advanced method for downloading web-pages from the internet has been proposed here [18], we will be using many concepts from these to improve the working of the server agent and more utilization of bandwidth.

III. PROBLEM OVERVIEW

There are many networks which have privilege accessibility to many sites and servers. For example being

inside NIT Durgapur, there is no authentication required for

downloading papers and other documents from IEEE and

ACM sites.

A user who has the right to access that privileged network,

but if he is outside that network he will not be able to. There

can be scenarios in which an Institute or Organization, can

pay for some services to be accessed inside their network. In

such situations the user has to be inside the network to enjoy

those services or, they can access the network from outside

by means of a Proxy Server (There are some more

possibilities).

IV. SCOPE OF WORK

The aim of this work is to automate this whole process. Make the work of the user easy and take advantage of the services or privilege that he is entitled to access being a part of the Institute or Organization. Developing the agent in JADE allows us to implement it for Mobile devices also. The only requirement for running any JADE agent is, Java Run Time Environment, which most of the System and Mobile Devices have.

In this project, the user will just need to send the keyword, and all the related documents matching that keyword will be downloaded and sent to the user. The user need not wait for the whole process to finish. He just needs to send the request, and the Multi-Agent System will perform the task for the User.

The main purpose of technology is to ease human work, so that the effort can be put to do more useful work. This project targets that specific purpose, with some added benefits to the user.

V. MULTI-AGENT BASED ARCHITECTURE

The agent system is divided in two parts:

There will be one single agent called Server Agent which serves the requests of multiple users.

Multiple Client Agents which will send a request to the Server Agent in form of a Keyword.

A. Server Agent

This Agent will run autonomously inside the network, which has privileged access, or has been authorized to a service. It will always be ready to accept request from client. Then that keyword will be searched in a Search Engine, and the source code of that web page will be downloaded using Java. That webpage will contain many links, and also some documents. If a link is found while searching the source code of that webpage, then its source code will also be downloaded. This can be visualized in the form of a graph as shown in figure 1. Building this graph will help us not to search for same links again. While parsing the source code of the webpage whenever a link is found, the java code for downloading the source will be called again and executed in


177

a different thread. Whenever any document is found then the java code for downloading files from web will be called and executed in a different thread. This process may continue forever, so we will restrict the depth of the graph from the starting page. Whenever we are parsing the source code which is at a maximum depth from the starting page then only documents in that web page will be downloaded.

Figure 1. Diagram showing the graph, of the links and documents.

B. Client Agent

This will be a very simple agent, it will perform two tasks.

Authentication of the user, it will send a request to the server with the credentials. If the user is authenticated then, the user will be able to perform his task.

Provide a simple GUI to the user for sending his keyword, relevant to which the user requires documents (or papers). The user can also directly send the link of the document, or IEEE page in this situation the keyword will not be searched in a search engine, but the server agent will perform the next step directly.

In figure 2 interactions between Client Agent, Server Agent and Java Codes in Server has been shown. First step is the authentication process, in which client sends the credential to the server agent for verification. If the credentials are verified then the client will be granted access. The client then sends the search keyword to the server agent, which then verifies if the keyword is valid. If it is valid then, the server agent calls the Java code for downloading source code, which will search all the links starting from the mail search page, and process as shown in figure 3. The source code downloader will send the list of all the documents found back to the server agent. Then the agent will send this list to the Document downloader code, which will download all the documents, and save it in a zipped folder ready to be sent to the client. Then it will notify the server agent that the

downloading has been done, and finally the server agent will send the zipped folder to the client.

Figure 2. Diagram showing the interaction between Server Agent, Client

Agent and Java Codes in Server.

VI. PROTOTYPE DESIGN

Figure 3 represents the main JADE prototype elements. An application based on JADE is made of a set of components called Agents each one having a unique name. Agents execute tasks and interact by exchanging messages. Agents live on top of a Platform that provides them with basic services such as message delivery. A platform is composed of one or more Containers. Containers can be executed on different hosts thus achieving a distributed platform. Each container can contain zero or more agents.

A special container called Main Container exists in the platform. The main container is itself a container and can therefore contain agents, but differs from other containers as

It must be the first container to start in the platform and all other containers register to it at bootstrap time.

It includes two special agents: the AMS that represents the authority in the platform and is the only agent able to perform platform management actions such as starting and killing agents or shutting down the whole platform (normal agents can request such actions to the AMS). The DF that provides the Yellow Pages service where agents can publish the services they provide and find other agents providing the services they need.

Agents can communicate transparently regardless of whether they live in the same container, in different containers (in the same or in different hosts) belonging to the same platform or in different platforms (e.g. A and B). Communication is based on an asynchronous message passing paradigm. Message format is defined by the ACL language defined by FIPA [19], an international organization that issued a set of specifications for agent interoperability. An ACL Message contains a number of fields including


178

The sender

The receiver(s).

The communicative act (also called performative) that represents the intention of the sender of the message. For instance when an agent sends an INFORM message it wishes the receiver(s) to become aware about a fact (e.g. (INFORM "today it's raining")). When an agent sends a REQUEST message it wishes the receiver(s) to perform an action. FIPA defined 22 communicative acts, each one with a well defined semantics, that ACL gurus assert can cover more than 95% of all possible situations. Fortunately in 99% of the cases we don't need to care about the formal semantics behind Communicative acts and we just use them for their intuitive meaning.

The content i.e. the actual information conveyed by the message (the fact the receiver should become aware of in case of an INFORM message, the action that the receiver is expected to perform in case of a REQUEST message).

In Figure 5, three agents have been shown two clients and one server. Every system that runs a JADE platform will have a main container where all the agents run. The two clients send a request to the server which contains the search query or the link for the paper/document to be downloaded.

Host 3: Server

Host 1: Client

Host 2: Client

Host 3 is the server, where the JADE platform will run, there is only one container called the Main Container where along with the server agent two more agents called AMS and DF will run. Name of the server agent is B@Platform2, and it address is http://host3:7778/acc. When the client agents will communicate with the server agent remotely then host3 must be fully qualified domain name.

There are two clients Host 1 and Host 2, both this will have a JADE platform with one container called the Main Container where along with client agent there will be two more agents called AMS and DF running. When a user wants to send a request to the server agent, the client agent will send a message to the server agent, where the receiver address in this case will be http://host3:7778/acc and the name of the agent will be B@Platform2, along with other necessary details.

Figure 3. Diagram showing two Client Agents, sending message to the

server agent.

VII. CONCLUSION

In this work, we have developed an Agent based system for accessing privileged services in a network remotely. The service in this scenario is subscription to IEEE and ACM sites that do not require authentication being inside the network. We have also automated the process of downloading papers/documents from the web that match the search keyword. This application is the first implementation of this type, so there is a lot of scope of improvement in it. We plan to improve the search and give better results, by considering the semantics of the search keyword. This work addresses one such privileged service, this model can be used as a base and expanded to include a lot more of such services and even provide automation wherever possible.

REFERENCES

[1] Christopher A. Rouff, Michael Hinchey, James Rash, Walter

Truszkowski, and Diana Gordon-Spears (Eds), “Agent Technology from a formal perspective” (Springer-Verlag London Limited 2006).

[2] G. Weiss, Ed., “Multiagent systems: a modern approach to distributed

artificial intelligence”, (MIT Press, 1999).

[3] N. J. Nilsson, “Artificial intelligence: a new synthesis”, (Morgan Kaufmann Publishers Inc., 1998).

[4] S. J. Russell and P. Norvig,” Artificial Intelligence: A Modern Approach”, (Pearson Education, 2003).

[5] M. J. Wooldridge, “Introduction to Multiagent Systems”, (John Wiley

& Sons, Inc., 2001).

[6] A. Idani, “B/UML: Setting in Relation of B Specification and UML Description for Help of External Validation of Formal Development

in B”, Thesis of Doctorat, The Grenoble University, November 2005.

[7] G. W. Brams, “Petri Nets: Theory and Practica”l, Vol. 1-2, (MASSON, Paris, 1982).

[8] M-J. Yoo, “A Componential For Modeling of Cooperative Agents

and Its Validation”, Thesis of Doctorat, The Paris 6 University, 1999.


179

[9] Jiun-Yan Shiau , Xiangyang Li, “Modeling the supply chain based on

multi-agent conflicts”, Service Operations, Logistics and Informatics, 2009. SOLI '09. IEEE/INFORMS International Conference on,

Publication Year: 2009 , Page(s): 394 – 399

[10] Yan Wang, YinZhang Guo, JianChao Zeng, “A Study of Logistics

System Model Based on Multi-Agent” , Service Operations and Logistics, and Informatics, 2006. SOLI '06. IEEE International

Conference on, Publication Year: 2006 , Page(s): 829 – 832

[11] R. Al-Khannak, B. Bifzer, Hezron, “ Grid computing by using multi agent system technology in distributed power generator” ,

Universities Power Engineering Conference, 2007. UPEC 2007. 42nd International, Publication Year: 2007 , Page(s): 62 – 67

[12] [Online].

Available:http://en.wikipedia.org/wiki/Workforce_management

[13] V. Krishna, V. Ramesh, “Portfolio management using cyberagents”, Systems, Man, and Cybernetics, 1998. 1998 IEEE International

Conference on, Issue Date : 11-14 Oct 1998 Volume : 5 , On page(s): 4860 - 4865 vol.5

[14] Application of Agent Technology to Traffic Simulation. United States

Department of Transportation, May 15, 2007.

[15] A. Hokimoto, K. Kurihara, T. Nakajima, “An approach for constructing mobile applications using service proxies” , Distributed

Computing Systems, 1996., Proceedings of the 16th International Conference on , Issue Date : 27-30 May 1996 , On page(s): 726 - 733

[16] C. Suriyakala, P.E. Sankaranarayanan, “Smart Multiagent Architecture for Congestion Control to Access Remote Energy

Meters” , Issue Date : 13-15 Dec. 2007 , Volume : 4 , On page(s): 24 - 28

[17] A. Mondal, K. Roy, P. Bhattacharya, “Secure and Simplified Access

to Home Appliances using Iris Recognition” , Computational Intelligence in Biometrics: Theory, Algorithms, and Applications,

2009. CIB 2009. IEEE Workshop on, Issue Date : March 30 2009-April 2 2009 , On page(s): 22 – 29

[18] A. Kundu, A.R. Pal, Tanay Sarkar, M. Banerjee, S. Mandal, R.

Dattagupta, D. Mukhopadhyay, “An Alternate Downloading Methodology of Webpages” , Artificial Intelligence, 2008. MICAI

'08. Seventh Mexican International Conference on , Issue Date : 27-31 Oct. 2008 , On page(s): 393 – 398

[19] [Online]. Available : http://www.fipa.org


180

An Interactive Multi-touch Teaching Innovation

for Preschool Mathematical Skills

Suparawadee Trongtortam aTechnopreneurship and innovation Management

Program


Bangkok, Thailand

[email protected]

Peraphon Sophatsathit and Achara Chandrachaia

Department of Mathematics and Computer Science,

Faculty of Science


Bangkok, Thailand

[email protected]/[email protected]

Abstract -The paper proposes a teaching medium that is

suitable for preschool children and teacher to develop

basic mathematical skills. The research applies the bases

of Multi-touch and Multi-point media technologies to

innovate an interactive teaching technique. By utilizing

Multi-touch and the connectivity structure of Multi-point

to create a technology that facilitates simultaneous

interaction from child learners, the teacher can better

adjust and adapt the lessons accordingly. The benefit of

this innovation is the amalgamation of technology and new

idea to supporting teaching media development that

permits teachers and students to interact to each other

directly, as well as self-learning by themselves.

Keywords-Multi-touch; Multi-point; preschool

mathematical skills; interactive teaching technique.

I. INTRODUCTION

Preschool learning is the first step education that

supports child learners in all aspects, e.g., physical,

intellectual, professional, and societal knowledge. One

of the most urgent and important activity to build their

learning is teaching media due to the significant role in

disseminating knowledge, experience, and other skills

to children. There are numerous teaching media for

preschool level, ranging from conventional paper based,

transparencies, audio, video, and computer based media.

The latter is the principal teaching vehicle which has

played an important role owing to its usefulness and

convenience. Children can learn by themselves [1, 2]

and be independent from classroom environment.

This research aims at using the connectivity of

Multi-point technique and Multi-touch approach as the

platform and underlying research process to develop

proper stimulating media for preschool children to learn

basic mathematics. The paper is organized as follows.

Section 2 and 3 briefly explain Multi-touch and Multi-

point technologies. Section 4 describes the proposed

approach, followed by the experiments in Section 5.

The results are summarized in Section 6. Section 7

concludes with the benefits and some final thoughts.

II. MULTI-TOUCH

Multi-touch [3] is a technology that supports several

inputs at the same time to create interaction between the

user and the computer. The system responds to finger

movement as commands issued by the user, e.g., select,

scroll, zoom or expand, etc. Fig. 1 shows multiple

fingers touching on several areas of the screen

simultaneously, thereby mimicking interactive reality of

learning that stimulates high alert attitude.

Figure 1. Multi-touch display and finger movement control

III. MULTI-POINT

Multi-point [4, 5] is a multiple computer connection

structure developed by Windows [6] for educational

institutes or learning centers. It uses one host to support

multi-user interface, permitting simultaneous user’s

responses. The underlying configuration is different

from conventional client-server (C-S) model in that

communication exchange in C-S takes place between

client and server in a pair-wise manner. Any exchange

among clients is implicitly routed through the server.

On the other hand, Multi-point is a simulcast among

peer where everyone can see one another

simultaneously and interactively. This is shown in Fig.

2. The result of such connectivity scheme is less

expenditure, power consumption, easier to manage

which is ideal for classroom environment.

Figure 2. Connectivity of Multi-point scheme


181


mailto:peraphon.s@

This research applies both technologies by

connecting several teaching aids with the help of

interactive teaching media. The media in turn facilitate

simultaneous teacher’s involvement and children’s

interaction. Teachers can teach and observe the

students, while the students can react to the lesson

promptly. Thus, lessons and practical exercises can be

explained, worked out, and corrected on-the-spot. As

such, the teacher can design the lesson and

accompanying exercises in an unobtrusive and

unbounded by physical means. Conventional preschool

teaching employs Computer Assisted Instruction (CAI)

[7] which provides media in sentences, images,

graphics, charts, graphs, videos, movies, and audio to

present the contents, lessons and exercises in the form

of banal classroom learning. Teaching by CAI can only

create interaction between the learner and the computer.

On the other hand, the proposed approach instigates and

collects responses from several children. The children

collectively learn, collaborate, express individual’s

opinion, and react as they proceed. This in turn

stimulates their interest and thought process for better

understand and knowledge acquisition.

Figure 3. device connection

Fig. 3 shows the inter-connection of electronic

devices for basic preschool mathematics which consists

of a Web server controlled by the teacher to observe

individual child learning. The exercises are designed

and broadcasted via duplex wireless means that allow

the student-teacher to interact back and forth

collectively at the same time.

IV. INTERACTIVE MEDIA TEACHING

Numerous educational media to create learning

lessons are prevalent in this digital age. CAI perhaps is

a predominant technique being adopted in all levels of

teaching. Unfortunately, the-state-of-the-practice falls

short of conveying “effective” teaching that inspires

learning toward knowledge. The limitations of CAI

technology precludes the teacher and students from

interacting to one another simultaneously. Thereby

spontaneous thinking and feedback can never be

motivated and learned systematically. We shall explore

the principal functionality of an interactive teaching

innovation.

Figure 4. Flow of interactive media teaching

Fig. 4 illustrates the flow of media set up for

interactive teaching. We exploit Multi-point principle to

attain higher children’s interaction through latest

electronic devices and Multi-point technology. By

strategically creating exercise in the form of interactive

game to sense the use of multiple fingers touching, their

thought process, while stimulating their interests

through game playing, the teacher can observe the

children’s behavior from their own screen to faster and

easier access and respond to the development of each

child. Thus, they can promptly monitor, instruct, or

sharpen the skill of individual child or the whole group,

without having to repeatedly recite the same instruction

to every child in the conventional classroom setting.

Some of the benefits precipitated from Multi-point

principle are:

1. Instant children and teacher interaction through

easily understood media of instructions. 2. Flexibility of creating or enhancing teaching media

to motivate children’s interests, thereby lessening

learning boredom. 3. Strengthen early childhood skills with the help of

drawing and graphical illustrations. 4. Increase the speed of cognitive learning in children

so as to facilitate subsequent skill development

evaluation. We will elaborate how the proposed scheme works out

in the sections that follow.


182

A. Teacher Preparation Configuration

Instructional aids are accomplished via our tool

which permits customized display format through

simple set up configurations. The teacher can prepare

her lessons and companion exercises off-line and

upload or post them to the system database. The

children will have access to all the materials once

upload or posted. Any un-posted instructions, lessons,

and exercises will not be accessible by the learners’

display device. The process flow is depicted in Fig. 5

B. Student Learning Process

The process begins with student’s sign-in to identify

himself. He then selects the lesson or exercise set to

work on. All the activities are monitored from the

teacher’s console where the results are made available

instantly. The process flow is depicted in Fig. 6

Figure 5. Preparation process Figure 6. Flow of student

of the teacher learning process

Fig. 5 illustrates the teacher preparation process that

proceeds as follows:

1. Select a topic to prepare the lesson. 2. Add or modify the exercises if the exercises are

already prepared in the early session.

3. Upload/post the materials in the database. In the meantime, the teacher can monitor the

students’ behavior during the lesson as follows:

1. Select the child to be monitored from list.

2. Observe their work.

3. Assess the results to analyze their behavior and

development.

C. Skill test by Bloom’s Taxonomy

Learning evaluation is carried out based on Bloom’s

Taxonomy [8] in the following aspects:

Media skills test

Subject comprehension from doing exercise

Self practice

The evaluation will adopt three basic indicators

given in Table I, namely, Knowledge, Comprehensive

and Application to measure the effectiveness of the

proposed interactive teaching innovation. This is

accomplished via actual preschool class setting by

means of CIPP model to be described in the next

section.

TABLE I. THREE LEVELS OF EVALUATION BY BLOOM’s TAXONOMY.

Level Evaluate

Knowledge

Able to tell the meaning of

positive or negative sign, matching, and shapes.

Comprehension Know how to complete

arithmetic operations

Application do exercise by themselves

D. Learning evaluation by CIPP model

This research makes use of CIPP model [9] to

evaluate the class performance with respect to the

following criteria: score, learning time, degree of

satisfaction, and the ratio of learning per expense. The

evaluation is performed in accordance with the CIPP

capabilities as follows:

Context: all required class materials from course

syllabus are divided into individual topics and subtopics

successively. Each subtopic is further broken down into

stories so that subject contents can be presented. The

corresponding companion exercise are either embedded

or added to the end to furnish as many hands-on drills

as possible.

Input: the above multimedia lessons are measured to

test/monitor the children's skill development,

particularly multi-touch drills. The indirect benefits

precipitated from this design are duration of work and

satisfaction.

Process: a number of evaluations are applied

through Multi-point and Multi-touch technologies. For

example, the time spent on exercise creation and

modification, session evaluation, and cost ratio, etc. In

addition, interactive monitoring, collaboration, and

assistance, instant results display (upon their

availability), and information transfer to/from server,

etc. The savings so obtained are the utmost achievement

of this innovative approach.

Product: the instantaneous interaction between

children and teacher, and the rate of self-learning upon

score improvement, result in tremendous skill

improvement and experience in new technology. Thus,

both score and user's satisfaction improve considerably.

V. EXPERIMENTAL RESULTS

The experiment was run on a Windows-based server

that supports two iPad display devices (to be used by a

preschool class). The proposed approach focused on a

preschool mathematics class, where children learned

basic arithmetic operations through interactive visual

lesson and exercise. Students retrieved their lesson and

corresponding exercises from the Multi-point teaching

media system. As the learning progressed, they

collaboratively worked on the lessons, exercises, and

other activities via the multi-touch system. Their

responses were record interactively (including


183

corrections, reworks, etc). The results were instantly

processed and made available in the teaching archive.

The process is shown in Fig. 7.

Figure 7. Flow of preschool mathematics exercise

Figure 8. Flow of design, modification, and monitoring exercise

Figure 9. Sample math exercises

Fig. 8 shows the flow of lesson and exercise

creation, modification, and monitoring the students’

activity interactively through the teaching media

system. Individual student’s screen can be selectively

monitored, assisted to correct errors or when help is

needed, and observed and reviewed their performance

via summary on score, frequency of attempts, reworks,

etc. All of which are supported by Multi-point

technique. Fig. 9 illustrates sample mathematics

exercises. We conducted student’s performance and teacher’s

productivity evaluations to measure the

accomplishments of both parties under the proposed

system in comparison with conventional CAI system.

The evaluations measured two instructional media on

the same and different lessons. From the students’

standpoint, the lesson was designed to observe how

students would learn by drawing analogy from the same

lesson and accumulate their skills from different lesson.

From the teacher’s standpoint, this would gauge how

productive the teacher performed on the same and

different lessons.

Several measures were collected and categorized

according to student and teacher, namely, exercise score

(D), duration of work (E), and degree of satisfaction (F),

as shown in Table II, and time spent on creating

exercise (M), time spent on one session evaluation (N),

and ratio of learning per expense (P), as shown in Table

III. For example, the exercise score obtained from the

students learning the same lesson using CAI is 5 out of

10 as oppose to 8 out of 10 problems via Multi-point. In

learning different lessons, the exercise score drops to 1

out of 10 from CAI, but still remains decent at 4 out of

10 problems by Multi-point. Similarly, Multi-point

outperforms CAI by one hour for the time spent on

creating exercise by the teacher in both cases. The same

outcomes hold true for learning per expense where more

teachers agree on the effectiveness of Multi-point than

CAI approach. The corresponding plots are depicted in

Fig. 10-13, respectively.

TABLE II. STUDENT PERFORMANCE EVALUATION

Detail Same Lesson Different Lessons

CAI Multi-Point CAI Multi-Point

D 5/10 8/10 1/10 4/10

E 20 min 13 min 45 min 30 min

F 9/15 12/15 5/15 9/15

Figure 10. Students’ performance Figure 11. Students’ performance

on the same lesson on different lessons

Table III. TEACHER PRODUCTIVITY EVALUATION

Detail Same Lesson Different Lessons

CAI Multi-Point CAI Multi-Point

M 4 hr. 3 hr. 4 hr. 3 hr.

N 60 min 20 min 85 min 25 min

P 7/15 13/15 4/15 9/15

Figure 12. Teacher’s performance Figure 13. Teacher’s performance

on the same lesson on different lessons

From the overall comparative evaluation, it is

apparent that the use of Multi-point and Multi-touch

technologies is more effective than the conventional

CAI approach from both student and teacher’s


184

standpoint. The obvious initial investment is fully offset

by better score, less time, higher satisfaction on the

student’s part, and more production and cost effective

on the teacher’s part. The percentage of agreeable

opinion on electronic media adoption is illustrated in

Fig. 14.

Figure 14 Percentage of electronic teaching media adoption

VI. CONCLUSION

We have proposed an interactive teaching

innovation for preschool children to improve their

mathematical skills. The contributions are two folds, (1)

the teacher can instruct and monitor preschool

children’s development in real-time, promptly obtaining

class evaluation, delivering lessons, and become more

economically productive over conventional CAI

approach; and (2) preschool children can improve their

mathematical skills, or knowledge in general, by

interactive means. They will become more enthusiastic

to explore new ideas, express themselves, and gain

confident and self-esteem as they progress. The

proposed approach is simple and straightforward to

realize. The underlying configuration exploits Multi-

point to simultaneously connect students with the

teacher, while interactively furnishes spontaneous

communications among them. In the meantime, students

can collaboratively work on the exercise to enhance

their learning skill via Multi-touch technology. The

resulting amalgamation is an innovative scheme which

is subsequently implemented as a teaching tool.

We targeted at developing their mathematical skills

to gauge how the overall configuration will work out.

The comparative summaries with conventional CAI

turned out to be superior and satisfactory in many

regards.

We envision that the proposed system can be further

extended to operate on larger network scale, whereby

wider student audience can be reached.

ACKNOWLEDGMENTS

We would like to express our special appreciation to

teachers and students of Samsen Nok School and

Phacharard Kindergarten School for their courteous

cooperation and invaluable time for this research.

REFERENCES

[1] National Education Act, 2542 No 116 At 74a

Rajchakichanubaksa, 19 August 1999.

[2] Division of Academic and Education Standards,

Office of the Elementary Education Commission,

MoE. 2546 B.E. Handbook of Preschool Education

Age 3-5 years, Ministry of Education, 2546 B.E.

[3] Wisit Wongvilai Software Technology of The

Future. [online], 2008, http://www.nectec.or.th, [8

July 2010].

[4] Suphada Jaidee. Electronic Learning Media.

[Online], 2007,

http://www.microsoft.com/thailand/press/

nov07/partners-in-learning.aspx, [12 July 2010].

[5] Pedro González Villanueva, Ricardo Tesoriero,

Jose A. Gallud, ”Multi-pointer and collaborative

system for mobile devices” , Proceedings of the

12th international conference on Human computer

interaction with mobile devices and services, pp

435-438, 2010.

[6] Windows. MultiPoint Server 2011. [Online],

http://www.microsoft.com/thailand/windows/multi

point/default.aspx, [12 August 2011].

[7] Donald L. Kalmey, Marino J. Niccolai, “A Model

For A CAI Learning System” , ACM SIGCSE

Bulletin Proceedings of the 12th SIGCSE

symposium on Computer science education, vol.

13, Issue 1, pp. 74-77, February 1981.

[8] Bloom, B.S et al. Taxonomy of Education

Objectives Classification of Education Goals,

Handbook I: Cognitive Domain, New York: David

Macky, 1972.

[9] Stufflebeam, D. L., The CIPP Model for program

evaluation. In Maduas, G. F., Scriven, M., &

Stufflebeam, D.L. Evaluation Model: Viewpoints

on Human Services Evaluation. Boston: Kluwer-

Nijhoff Publications, 1989.


185

http://www.microsoft.com/thailand/press/%20nov07/partners-in-learning.aspx

http://www.microsoft.com/thailand/press/%20nov07/partners-in-learning.aspx

AUTHOR INDEX

Pages Agarwal, Navin 176 Aditya, Narayan Hati 116 Acharya, Sudipta 169 Bernard, Thibault 14 Bui, Alain 14 Chandrachai, Achara 181 Chaiwongsa, Punyaphat 145 Fung, Chun Che 24, 42 Chongstitwattana, Prabhas 133 Chen, Ting-Yu 30 Dutta, Animesh 169, 176, 138 Upadhyay, Prajna Devi 169, 138 Tran, Hung Dang 75 Smith, Derek H. 127 Hunt, Francis 127 Grachangpun, Rugpong 70 Getta, Janusz 121 Ghosh, Supriyo 138 Haruechaiyasak, Choochart 58, 70 Hiransakolwong, Nualsawat 81 Johannes, Fliege 98 Sil, Jaya 116 Jana, Nanda Dulal 116 Kamsiang, Nawarat 163 Kajornrit, Jesada 24, 42 Wong, Kok Wai 24 Keeratiwintakorn, Phongsak 19 Kongsakun, Kanokwan 42 Kubek, Mario 104 Leelawatcharamas, Tunyathorn 48 Le, Pham Thi Anh 157, 75 Li, Yuefeng 92 Lin, Yung-Chang 92 Minh, Quang Nguyen 75, 157 Muchalintamolee, Nuttida 151 Dewan, Mohammed 109 Quaddus, Mohammed 109 Mehta, Kinjal 87 Minh, Quang Nguyen 157, 75 Salani, Matteo 127 Mingkhwan, Anirach 64 Mandal, Sayantan 116 Meesad, Phayung 36 Montemanni, Roberto 127 Nitsuwat, Supot 54 Nakmaetee, Narisara 58


186

Pages Nhan, Le Thanh 157 Ouedraogo, Boukary 14 Paoin, Wansa 54 Pattaranantakul, Montida 8 Sangwongngam, Paramin 8 Sangsongfa, Adisak 36 Senivongse, Twittie 48, 151, 163 Sheth, Ravi 87 Sodanil, Maleerat 58, 70 Sophatsathit, Peraphon 187 Sripimanwat, Keattisak 8 Tansriwong, Kitipong 19 Trongtortam, Suparawadee 181 Upadhyay, Prajna Devi 138, 169 Unger, Herwig 104 Waijanya, Sajjaporn 64 Wolfgang, Benn 98 Wu, Ming-Che 30 Wu, Sheng-Tang 92 Yampaka, Tongjai 133 Yawai, Wiyada 81 Zimniak, Marcin 98


187

The 9th

International Conference on

Computing and Information Technology

10-11 May 2013

At Faculty of Information Technology

King Mongkut’s University of Technology

North Bangkok, Thailand

www.ic2it.org


King Mongkut’s University of

Technology North Bangkok

www.it.kmutnb.ac.th