Çukurova university institute of natural and applied ... · bu çalışmada, web sayfaları...

ÇUKUROVA UNIVERSITY INSTITUTE OF NATURAL AND APPLIED SCIENCES

MSc THESIS

Esra SARAÇ

WEB PAGE CLASSIFICATION USING ANT COLONY OPTIMIZATION

DEPARTMENT OF COMPUTER ENGINEERING

ADANA, 2011



Esra SARAÇ

MSc THESIS


We certify that the thesis titled above was reviewed and approved for the award of degree of the Master of Science by the board of jury on 18/01/2011. ……………….................... ………………………….. ……................................ Asst. Prof. Dr. Selma Ayşe ÖZEL Prof. Dr. Mehmet TÜMAY Assoc. Prof. Dr. Zekriya TÜFEKÇİ SUPERVISOR MEMBER MEMBER This MSc Thesis is written at the Department of Institute of Natural And Applied Sciences of Çukurova University. Registration Number:

Prof. Dr. İlhami YEĞİNGİL Director Institute of Natural and Applied Sciences

Not:The usage of the presented specific declerations, tables, figures, and photographs either in this thesis or in

any other reference without citiation is subject to "The law of Arts and Intellectual Products" number of 5846 of Turkish Republic

I

ABSTRACT

MSc THESIS


Esra SARAÇ



Supervisor : Asst. Prof. Dr. Selma Ayşe ÖZEL Year: 2011, Pages: 73 Jury : Asst. Prof. Dr. Selma Ayşe ÖZEL : Prof. Dr. Mehmet TÜMAY : Assoc. Prof. Dr. Zekeriya TÜFEKÇİ

In this study, Web pages are classified by selecting the best features using an Ant Colony Optimization algorithm and applying the C4.5 classifier. The proposed Ant Colony Optimization based algorithm was experimented on the WebKB and the Conference datasets. The aim of this study is to reduce the number of features to be used during the classification process to improve run-time performance and efficiency of the classifier. The experimental results of this study showed that, Ant Colony Optimization is an acceptable optimization algorithm for Web Page feature selection. Key Words: Feature Selection, Ant Colony Optimization, Web Page Classification

II

ÖZ

YÜKSEK LİSANS TEZİ

KARINCA KOLONİSİ OPTİMİZASYONU KULLANARAK WEB SAYFASI SINIFLANDIRMA

Esra SARAÇ

ÇUKUROVA ÜNİVERSİTESİ FEN BİLİMLERİ ENSTİTÜSÜ

BİLGİSAYAR MÜHENDİSLİĞİ ANABİLİM DALI

Danışman : Yrd. Doç. Dr. Selma Ayşe ÖZEL Yıl: 2011, Sayfa: 73 Jüri : Yrd. Doç. Dr. Selma Ayşe ÖZEL : Prof. Dr. Mehmet TÜMAY : Doç. Dr. Zekeriya TÜFEKÇİ

Bu çalışmada, Web sayfaları Karınca Kolonisi Eniyilemesi algoritması kullanılarak seçilen en iyi niteliklerle ve C4.5 sınıflandırıcısı uygulanarak sınıflandırılmıştır. Önerilen Karınca Kolonisi Eniyilemesine dayalı algoritma WebKB ve Konferans veri kümeleri üzerinde denenmiştir. Bu çalışmanın amacı çalışma zamanı performansını ve sınıflandırıcının etkinliğini iyileştirmek için sınıflandırma işlemi sırasında kullanılan nitelik sayısını azaltmaktır. Anahtar Kelimeler: Nitelik Seçimi, Karınca Kolonisi Optimizasyonu, Web Sayfası

Sınıflandırması

III

ACKNOWLEDGEMENTS

Foremost, I would like to express my sincere gratitude to my advisor Assist.

Prof. Dr. Selma Ayşe ÖZEL, for her supervision guidance, encouragements,

patience, motivation, useful suggestions and her valuable time for this work.

I would like to thank members of MSc thesis jury, Prof. Dr. Mehmet

TÜMAY and Assoc. Prof. Dr. Zekeriya TÜFEKÇİ, for their suggestions and

corrections.

My sincere thanks also goes to Nilgün Özgenç, Neslihan Gündoğdu and

Çiğdem İnan Acı, for their patience, motivation and useful suggestions.

Special thanks to my right hand Bengisu Özyeşildağ, for her endless support

and patience.

Last but not the least, I would like to thank my family: my parents Ayşe,

Şeref and my brothers Emre and Kemal Saraç, for their endless support and

encouragements for my life and career.

IV

CONTENTS PAGE

ABSTRACT ......................................................................................................... I

ÖZ ....................................................................................................................... II

ACKNOWLEDGEMENTS ............................................................................... III

CONTENTS……………………………………………………………………...IV

LIST OF TABLES............................................................................................. VI

LIST OF FIGURES .............................................................................................X

l. INTRODUCTION ............................................................................................ 1

2. PRELIMINARY WORK ................................................................................. 5

2.1. Preliminary Works in Web Page Classification ……………….………...5

2.2. Preliminary Works in Feature Selection…………………………..……...9

2.3. Preliminary Works in Nature inspired techniques in Web Page

Classification and Feature Selection.…………………………………....13

3. MATERIAL AND METHOD ........................................................................ 17

3.1. Material. ............................................................................................... 17

3.1.1. WebKB Dataset…………………………………………………..17

3.1.2. Conference Dataset………………………………………..…...... 18

3.1.3. Ant Colony Optimization ………………………….....……...…..18

3.1.4. Weka Data Mining Tool……………………….……………....…24

3.2. Method..………………..…………………………………………….....27

3.2.1. Construction of Dataset ……………………………………….....29

3.2.2. Feature Extraction …………………………………………….....30

3.2.3. Feature Selection ……………………………………………...…31

4. RESEARCH AND DISCUSSION .............................................................. …37

4.1. Classification Experiments With Only URL Addresses.……………….39

4.2. Classification Experiments With Only <title> Tags……..………...…...41

4.3. Classification Experiments With Bag of Terms Method..………...……43

4.3.1.Classification Experiments With Bag of Terms Method in 5%

Document Frequency Value…………………….………………...44

V

4.3.2. Classification Experiments With Bag of Terms Method in 10%

Document Frequency Value…………..…………………………..46


Document Frequency Value…………..…………………………..48

4.4. Classification Experiments With Tagged Terms Method……………....50

4.4.1. Classification Experiments With Tagged Terms Method in 5%

Document Frequency Value..…….……………………………….51


Document Frequency Value………......…………………………..54


Document Frequency Value…..………………………………......56

4.5. Comparison With C4.5………………………………………………….60

4.6. Comparison of the Proposed Method With Earlier Studies.……………61

5. CONCLUSION ............................................................................................. 63

REFERENCES……………………………………………………….…………..65

CURRICULUM VITAE….………………………………………….…………...73

VI

LIST OF TABLES PAGE

Table 3.1. Distribution of Each Class .................................................................... 17

Table 3.2. Distribution of Pages With Respect to Universities ............................... 18

Table 3.3. Distribution of Conference Dataset Pages ............................................. 18

Table 3.4. Train/Test Distribution of WebKB Dataset for Binary Class

Classification ........................................................................................ 29

Table 3.5. Train/Test Distribution of the Conference Dataset ................................ 30

Table 3.6. Number of Features for all Classes According to the Selected Tags ...... 31

Table 3.7. Number of Features for all Classes In Bag-of-Term Method with

Respect to Document Frequency Values ............................................... 32

Table 3.8. Number of Features for all Classes in Tagged Term Method with

Respect to Document Frequency Values ............................................... 32

Table 4.1. Classification Results of Student Class With Respect to 500 Epoch

Value .................................................................................................... 37

Table 4.2. Classification Results of Student Class With Respect to 250 Epoch

Value .................................................................................................... 38

Table 4.3. F-measures of NB, RBF and C4.5 Classifiers for the WebKB

Dataset ................................................................................................. 39

Table 4.4. Experimental Results Using URLs With 60 Features of All Classes ..... 40

Table 4.5. Experimental Results Using URL With 30 Features of All Classes ....... 40

Table 4.6. Experimental Results Using URLs With 10 Features of All Classes ..... 41

Table 4.7. Experimental Results Using <title> Tags With 60 Features of All

Classes ................................................................................................. 42


Classes ................................................................................................. 42


Classes ................................................................................................. 43

Table 4.10. Experimental Results Using Bag of Terms Method for Course

Class With 5% Document Frequency.................................................... 44

VII

Table 4.11. Experimental Results Using Bag of Terms Method for Project


Table 4.12. Experimental Results Using Bag of Terms Method for Student


Table 4.13. Experimental Results Using Bag of Terms Method for Faculty


Table 4.14. Experimental Results Using Bag of Terms Method for Conference



Class With 10% Document Frequency .................................................. 47



















Table 4.25. Number of Features For Each Tag With 5% Document Frequency

Value for Each Class ............................................................................ 51

VIII

Table 4.26. Experimental Results Using Tagged Terms Method for Course

Class With 5% Document Frequency ................................................... 52

Table 4.27. Experimental Results Using Tagged Terms Method for Project


Table 4.28. Experimental Results Using Tagged Terms Method for Student


Table 4.29. Experimental Results Using Tagged Terms Method for Faculty


Table 4.30. Experimental Results Using Tagged Terms Method for Conference



Value .................................................................................................... 54












Value .................................................................................................... 56






Class With 15% Document Frequency ................................................. 57

IX





Table 4.43. Distribution of Selected Features With Respect to Tags for Project

Classes When 15% Document Frequency is applied ............................. 59

Table 4.44. Distribution of the Selected Features With Respect to Tags for the

Best Cases ............................................................................................ 59

Table 4.45. Comparison of the Proposed ACO Feature Selection Algorithm

with C4.5 .............................................................................................. 60

X

LIST OF FIGURES PAGE

Figure 2.1. Binary classification ……………………………………………………..6

Figure 2.2. Multiclass classification ......................................................................... 7

Figure 3.1. Figure of Ant Behavior ......................................................................... 19

Figure 3.2. ACO Flow Chart .................................................................................. 24

Figure 3.3. The starting graphical user interface of Weka ....................................... 25

Figure 3.4. Explorer Environment of Weka ............................................................ 26

Figure 3.5. iris.arff File as an Example of arff File ................................................. 27

Figure 3.6. Architecture of The Proposed System ................................................... 28

Figure 3.7. Flow Chart of Proposed ACO Algorithm .............................................. 34

Figure 3.8. An Instance of an arff File .................................................................... 35

Figure 3.9. Pseudo Code of General C4.5 algorithm ............................................... 36

1. INTRODUCTION Esra SARAÇ

1

1. INTRODUCTION

The rapid growth of the Internet use and the developments in the

communication technologies, have caused rapid increase in the amount of online text

information. As a result of this it became difficult to manage the huge amount of

online information. To solve this problem, a lot of new techniques have been

developed and used by search engines. Several studies are done to give more

accurate and fast results to users. One of the most important studies in this area is

text classification. Text categorization or classification, which is widely used by

search engines, is one of the key techniques for handling and organizing text data.

The aim of text categorization is to classify documents into a certain number

of pre-defined categories by using document features. Text classification plays a

crucial role in many information retrieval and management tasks. These tasks are;

information retrieval, information extraction, document filtering and building

hierarchical Web directories (Qi and Davison, 2009). When text classification

focuses on Web pages it is named as Web Classification, or Web Page Classification.

However, Web pages are different from text, and they contain a lot of information,

such as URLs, links, HTML tags which are not supported by text documents.

Because of this distinction, Web classification is different from traditional text

classification (Qi and Davison, 2009).

On the Web, page classification is essential to topic-specific Web link

selection, to analysis of the topical structure of the Web, to development of Web

directories, and to focused crawling (Qi and Davison, 2009). People manually

constructed some Web directories such as the Yahoo! (http://www.yahoo.com/) and

the Open Directory Project (www.dmoz.org), and manually labeled documents for

classification. However, manual classification is time consuming and needs a lot of

human effort, which makes it un-scalable with respect to the high growing speed of

the Web. Therefore, there has ensued great need for automated Web page

classification systems. In addition to time benefits of automated Web page

classification, by this way classification results have become more clear and

http://www.yahoo.com/)

http://www.dmoz.org)


2

objective with respect to manual classification due to the subjectivity of human

experts.

The first examples of the text classification systems were automated text

indexing systems, and they were studied in 70’s (Salton, 1970). Then these systems

were supported with machine learning techniques to improve the classification

performance. A major problem of the text classification is the high dimensionality of

the feature space. We need to select proper subsets of features from the original

feature space to reduce the dimensionality of feature space and to improve the

efficiency and performance of the classifier (Shang et al., 2007). Several approaches

have been applied to select proper features (Yu and Liddy, 1999). Some of these

methods are Document Frequency, Information Gain, Mutual Information and x2

Test (Han and Kamber, 2006).

Feature extraction methods have become an important issue for classification

performance. Therefore, too many studies have been made in this area. In the past,

researchers focused on classification of the text files. But as a result of increased

online documents, in recent years interest in Web categorization has rapidly grown

(Qi and Davison, 2009). Classification of Web page content is essential to many

tasks in Web information retrieval such as maintaining Web directories, focused

crawling, question-answering systems and etc (Qi and Davison, 2009). Since a Web

page has more features than a text document, feature selection becomes more

important for Web Page Classification than for typical text classification.

To select best features, application of machine learning techniques is one of

the most popular techniques. Machine learning algorithms are separated into two

main categories (Mitchell, 1997). These are supervised learning and unsupervised

learning. In supervised learning; a global model that maps input objects to desired

outputs is generated (Mitchell, 1997). Support vector machines, k-Nearest Neighbors

and Naïve Bayes algorithms are the examples of the supervised learning. The task of

the supervised learner is to predict the value of the function for any valid input object

after having seen a number of training examples (Mitchell, 1997). To achieve this,

the learner has to generalize from the presented data to unseen situations in a

reasonable way. In unsupervised learning on the other hand, all the observations are


3

assumed to be caused by latent variables, that is, the observations are assumed to be

at the end of the casual chain (Mitchell, 1997). The Self Organizing Maps are

commonly used unsupervised learning algorithms (Haykin, 1999). Supervised

learning requires that the target variable is well defined and that a sufficient number

of its values are given. For unsupervised learning typically either the target variable

is unknown or has only been recorded for too small number of cases (Mitchell,

1997).

Up till today, several methods and solutions have been developed for feature

extraction problems. However, many of these methods are proposed for text

classification. For classifying Web pages which have a lot of features, there are only

a few studies for selecting features. With the contribution of the structural properties

of Web pages such as HTML tags and headers, an Ant Colony Optimization (ACO)

algorithm is thought to be a different method for extracting features. The purpose of

this study is to find the best features in HTML pages by using an optimization

algorithm to improve performance of the existing Web page classification systems.

In order to choose the best features, the Ant Colony Optimization technique, which

was developed to solve optimization problems, will be used.


4

2.PRELIMINARY WORK Esra SARAÇ

5

2. PRELIMINARY WORK

2.1. Preliminary Works in Web Page Classification

Classification is the process of assigning predefined appropriate class labels

to the available data. For this purpose, a set of labeled data is used to train a classifier

which is then used for labeling unseen data. This classification process is also

defined as a supervised learning (Mitchell, 1997).

The process is not different in Web page classification, there is one or more

predefined class labels and a classification model assigns Web pages to one or more

predefined class labels. Web page classification assigns a label from a predefined set

of labels to a Web page. Web pages, which are in fact hypertext, have many features

such as textual tokens, markup tags, URLs, host names in URLs and all these

features could be meaningful for classifiers. Therefore, Web page classification has

several differences from traditional text classification.

Web page classification has some subfields like subject classification and

functional classification (Qi and Davison, 2009). In subject classification, classifier is

concerned with the content of a Web page and tries to determine the “subject” of the

Web page. For example, categories of online newspapers like finance, sport,

technology, are instances of subject classification. Functional classification is

concerned with function or type of the Web page. For example, determining a page is

a “personal homepage” or a “course page” is an instance of a functional

classification. These two types of classification are most popular classification types

(Qi and Davison, 2009).

Classification can be divided into binary classification and multiclass

classification according to the number of classes (Qi and Davison, 2009). In binary

classification there is only one class label. Classifier looks an instance and assigns it

into the specific class or not. Instances of the specific class are called as relevant, and

the others are named as non-relevant instances. Binary class classification process is

presented in Figure 2.1.


6

Figure 2.1 Binary classification (Qi and Davison, 2009)

If there are more than one class, this type of classification is called as multiclass

classification (Qi and Davison, 2009). The classifier also assigns an instance to one

of the multiple classes. Multiclass classification can be viewed in Figure 2.2.

In our problem, we studied on Web content classification which is called as

subject classification, and we focused on binary classification. Binary classification

is the basis of the focused crawler (Chakrabarti et al., 1999) or topical crawler

(Menczer and Belew, 1998). The aim of a focused crawler is to increase the search

engine performance by crawling and indexing Web pages about a specific topic. To

achieve this goal a focused crawler needs to determine whether a Web page is on the

specific class or not.


7

Figure 2.2 Multiclass classification (Qi and Davison, 2009)

Web content classification has a lot of benefits for information retrieval tasks.

By using classification techniques, the performance of information retrieval

processes can be improved. As an example Web directories are the platforms which

include predefined set of categories that are available for browsing information. The

most popular examples of Web directories are Yahoo! (http://www.yahoo.com) and

Open Directory Project (http://www.dmoz.org). Constructing, maintaining or

expanding Web directories manually need extensive human effort. Huang et al.

(2004) proposed an approach which is named as Liveclassifier for the automatic

creation of classifiers from Web corpora based on user-defined hierarchies. They

have presented a system that can automatically extract corpora from the Web to train

classifiers. They used Vector Space Model to describe the features. They used tf*idf

method to describe similarities between features and classes. Based on the two basic

methods, they developed their own method which was named as LiveClassifier.

According to this study, the main merits of LiveClassifier are its wide adaptability

and its flexibility. The classifier can be created by defining a topic hierarchy. The

http://www.yahoo.com)



8

necessary corpora can be fetched and organized automatically, promptly, and

effectively. By making automatic classification; construction, maintenance and

expansion of Web directories become more effective, and automatic classification

techniques improve the performance of Web directories (Huang et al., 2004).

Improving search results quality is another example of the benefits of Web

content classification. Query terms are important for search results. If user can not

select the right term, the returned search results would be meaningless. If the selected

term has multiple meanings, query results would be irrelevant with the search term.

For example, query term “bank” could mean “the border of a body of a water” or

“financial establishment” (Qi and Davison, 2009). Various approaches have been

proposed to improve retrieval quality by disambiguating query terms. Chekuri et al.

(1997) have studied automatic Web page classification to increase the precision of

Web search. In this study, at query time, the user is asked to specify one or more

desired categories so that only the results in those categories are returned, or the

search engine returns a list of categories under which the pages would fall. This

approach works when the user is looking for a known item. In such a case, it is not

difficult to specify the preferred categories. However, there are situations in which

the user is less certain about which documents will match; in this case the above

approach does not help much.

Also as a solution for ranking problem, Page and Brin (1997) developed the

link-based ranking algorithm called the PageRank. The PageRank was developed

at Stanford University as a part of a research project about a new kind of search

engine (Page and Brin, 1997). They had the idea that information on the Web could

be ordered in a hierarchy by "link popularity": a page is ranked higher as there are

more links to it. PageRank calculates the authoritativeness of Web pages based on a

graph constructed by Web pages and their hyperlinks, without considering the topic

of each page. Since then, much research has been explored to differentiate authorities

of different topics. Haveliwala et al. (2003) have proposed Topic-sensitive

PageRank, which performs multiple PageRank calculations, one for each topic.

When computing the PageRank score for each category, the random surfer jumps to

a page in that category at random rather than just any Web page. This has the effect


9

of biasing the PageRank to that topic. This approach needs a set of pages that are

accurately classified.

If only domain-specific queries are expected, performing a full crawl is

usually inefficient. Focused crawling which was proposed by Chakrabarti et al.

(1999), is an approach that crawls documents only in a small part of the Web. A

focused crawler analyzes its crawl boundary to find the links that are likely to be the

most relevant for the crawl, and avoids irrelevant regions of the Web. It has

predefined set of topics and these sets construct crawling area. In this approach, a

classifier is used to evaluate the relevance of a Web page to the given topics so as to

provide evidence for the crawl boundary. The proposed algorithm consists of two

parts: a classifier that evaluates the relevance of a hypertext document with respect to

the focus topics, and a distiller that identifies hypertext nodes that are great access

points to many relevant pages within a few links (Chakrabarti et al., 1999).

Good quality document summarization can accurately represent the major

topic of a Web page. Shen et al. (2004), proposed an approach to classify Web pages

through summarization. They showed that classifying Web pages on their summaries

is able to improve the accuracy by around 10% as compared with content based

classifiers.

2.2.Preliminary Works in Feature Selection

Feature selection is the one of the most important steps in classification

systems. Web pages are generally in HTML format. This means that Web pages are

not in a fully structured format. They are semi-structured since they contain HTML

tags and hyperlinks in addition to pure text. Because of this, feature selection in Web

page classification is different than traditional text classification. Feature selection is

commonly used to reduce dimensionality of datasets with tens or hundreds of

thousands of features which would be impossible to process further. A major

problem of Web page classification is the high dimensionality of the feature space.

The best feature subset contains the least number of features that most contribute to

accuracy and efficiency.


10

To improve the performance of Web page classification, several approaches

that are imported from feature selection for text classification have been applied to

the problem of feature selection for Web page classification. In addition to traditional

feature selection methods, machine learning techniques are also popular algorithms

to be used for feature selection.

A lot of feature scoring measures have been proposed. Information Gain

(Mitchell, 1997), Mutual Information (Shannon, 1948), Document Frequency (Yang

and Pedersen, 1997), and Term Strength (Chakrabarti, 2002) techniques are the most

popular traditional techniques. Information gain (IG) measures the amount of

information in bits about the class prediction, if the only information available is the

presence of a feature and the corresponding class distribution. Concretely, it

measures the expected reduction in entropy (Mitchell, 1997). Mutual information

(MI) was first introduced by Shannon (1948) in the context of digital

communications between discrete random variables and was generalized to

continuous random variables. Mutual information is considered as an acceptable

meter of relevance between two random variables (Cover and Thomas, 1991).

Mutual Information method is a probabilistic method which measures how much

information the presence/absence of a term contributes to making the correct

classification decision on a class (Guiasu and Silviu, 1977). Document frequency

(DF) is the number of documents in which a term occurs in a dataset. It is the

simplest criterion for term selection and easily scales to a large dataset with linear

computation complexity. It is a simple but effective feature selection method for text

categorization (Yang and Pedersen, 1997). Term strength (TS) was proposed and

evaluated by Wilbur and Sirotkin (1992) for vocabulary reduction in text retrieval.

Term strength is also used in text categorization (Yang, 1995: Yang and Wilbur,

1996). This method predicts term importance based on how commonly a term is

likely to appear in “closely-related” documents. TS uses a training set of documents

to derive document pairs whose similarity measured using the cosine value of the

two document vectors is above a threshold. Then, “Term Strength” is computed

based on the predicted conditional probability that a term occurs in the second half of

a pair of related documents given that it occurs in the first half. The above methods


11

namely the IG, the DF, the MI, and the TS are compared by Yang and Pedersen

(1997). They used kNN classifier on the Reuters

(http://archive.ics.uci.edu/ml/databases/reuters) corpus. According to this study

(Yang and Pedersen, 1997), IG is the most effective method with 98% feature

reduction, DF is the simplest method with the lowest cost in computation and it can

be credibly used instead of IG if computation of these measures is too expensive.

Kwon and Lee (2000, 2003), proposed classifying Web pages using a

modified k-Nearest Neighbor algorithm, in which terms within different tags are

given different weights. At the k-Nearest Neighbor algorithm a constant k value is

selected. They divided all the HTML tags into three groups and assigned each group

a random weight. Thus, utilizing tags can take advantage of the structural

information embedded in the HTML files, which is generally ignored by plain text

approaches. However, since most HTML tags are oriented toward presentation rather

than semantics, Web page authors may generate different but conceptually equivalent

tag structures. Therefore, using HTML tagging information in Web classification

may suffer from the inconsistent formation of HTML documents. In Kwon and Lee’s

(2000) modified k-nearest neighbor algorithm features are selected using two well-

known metrics: expected mutual information and mutual information. They also

weighted terms according to the HTML tags that the term appears. This means that

terms within different tags bear different importance. k-Nearest Neighbor (kNN)

classifiers require a document dissimilarity measure to quantify the distance between

a test document and each training document. They replace the traditional cosine

measure by their own similarity measure which is called as “matching factor”. They

called the number of the matching terms as the "matching factor". The proposed

similarity measure was modified from the traditional similarity measure to use the

matching factor. The intuition behind their similarity measure is that frequently co-

occurring terms constrain the semantic concept of each other. The more co-occurred

terms any two documents have in common, the stronger the relationship between the

two documents. According to the experimental results, with only cosine similarity

measure, micro averaging breakeven point was reported as 18.23% and with only

inner product method micro averaging breakeven point was reported as 19.74%. If

http://archive.ics.uci.edu/ml/databases/reuters)


12

matching factor was used with these two similarity methods, results were reported as

19.23% and 20.02%, respectively.

Chakrabarti et al. (1998), proposed a term-based classifier which uses a score-

based function for feature selection. The proposed algorithm provides new

techniques for automatically classifying hypertext into a given topic hierarchy, using

information latent in hyperlinks. There is much information in the hyperlink

neighborhood of a document. The iterative relaxation technique bootstraps off a text-

based classifier and then uses both local text in a document, as well as the

distribution of the estimated classes of other documents in its neighborhood, to refine

the class distribution of the document being classified. They applied this approach to

Yahoo! directory. They converted documents into bag-of-words format. According

to this study, the proposed algorithm is able to improve the accuracy from 32% to

75%. Using even a small neighborhood around the test document significantly

boosts classification accuracy, reducing error up to 62% from text-based classifiers.

Rather than deriving information from the page content, Kan and Thi (2005),

demonstrated that a Web page can be classified based on its URL. They inspired

from Kan (2004)’s study. Although accuracy is not high, this approach eliminates the

necessity of downloading the page. Therefore, it is especially useful when the page

content is not available or time/space efficiency is strictly emphasized. Performance

of the proposed study was measured both by accuracy and macro F1. According to the

experimental results, accuracy value was reported as 76.18% and macro F1 value was

reported as 0.525. In addition to these results, their URL-only method achieved about

95% of the performance of the full text methods. A similar URL based study is

proposed by Baykan et al. (2009), and it is about binary classification of Web pages.

In this study, features are extracted from URL addresses of Web pages. They have

used a support vector machine (SVM) and an n-gram technique to classify Open

Directory Project (ODP) (www.dmoz.org) Web pages. According to the

experimental results, F-measure values are between 80% and 85%.

Most supervised learning approaches only learn from training examples. Co-

training, which is introduced by Blum and Mitchell (1998), is an approach that

makes use of both labeled and unlabeled data to achieve better accuracy. In a binary



13

classification scenario, two classifiers that are trained on different sets of features are

used to classify the unlabeled instances. The prediction of each classifier is used to

train the other. Compared with the approach which only uses the labeled data, this

co-training approach is able to cut the error rate by half. Ghani (2001: 2002) adopted

this approach to multi-class problems. The results showed that co-training does not

improve accuracy when there is a large number of categories. Classification usually

requires manually labeled positive and negative examples. Yu et al. (2004) devised

an SVM-based approach to eliminate the need for manual collection of negative

examples while still retaining similar classification accuracy. Given positive data and

unlabeled data, their algorithm is able to identify the most important positive

features. Using these positive features, it filters out possible positive examples from

the unlabeled data, which leaves only negative examples. An SVM classifier could

then be trained on the labeled positive examples and the filtered negative examples.

2.3.Preliminary Works in Nature inspired techniques in Web Page Classification

and Feature Selection

Nature inspired techniques including genetic algorithm (GA), ant colony

optimization (ACO) and particle swarm optimization (PSO) algorithms have also

been proposed for text and Web classification problems.

Gordon (1988) has used GAs to find best document illustrator for each user

specified document according to the queries used and the relevance judgments made

during the retrieval process. This is one of the earliest studies on application of GAs

to information retrieval domain. Chen and Kim (1995) have proposed a hybrid GA

and neural network based system, called GANNET. They used a GA to choose the

best keywords which describe user-selected documents, and with a neural network,

weights for the keywords are determined. Additionally, Boughanem et al. (1999)

have applied a GA-based technique to optimize document descriptions and to

improve query formulations. Ribeiro et al. (2003) proposed a Web page classifier

which is based on rule extraction. They labeled Web pages as ‘‘Not Relevance”,

‘‘Medium Relevance”, and ‘‘Extreme Relevance” with fuzzy membership function.


14

They used both Navie Bayes and a GA for classification. In their study, fuzzy

membership function performed better with a Naive Bayes classifier than with a GA-

based classifier. Liu and Huang (2003) have proposed a semi-supervised fuzzy

clustering algorithm based on a GA. Both labeled and unlabeled documents are taken

together to derive a classifier. Each document is represented as tf-idf weighted word

frequency vector, and stemming and stopword removal are not used. HTML tags are

also not considered. Liu and Huang (2003) compared their classifier with Naive

Bayes, and observed gain in classification accuracy. Özel (2010) has proposed a Web

page classification which is based on a GA. In the proposed approach both HTML

tags and terms are used as features and optimal weights of features are learned with a

GA. The proposed method was compared with the Navie Bayes and the KNN

algorithms. The accuracy of the proposed approach is higher than the compared

algorithms.

AntMiner (Parpinell et al., 2002) is the first method that uses the ACO in the

classification domain. Holden and Freitas (2004) inspired from AntMiner (Parpinell

et al., 2002). Holden and Freitas (2004) make use of the Ant Colony paradigm to find

a set of rules that classify the Web pages into several categories. They made no prior

assumptions which words in the Web pages to be classified were to be used as

potential discriminators. To reduce data rarity, they used stemming; a technique with

different grammatical forms of a root word can be considered equivalent such as

help, helping, helped. Holden and Freitas (2004) also gathered sets of words if they

were closely related in the WordNet electronic thesaurus. Holden and Freitas (2004)

compared their Ant_Miner with the rule inference algorithms C4.5 and CN2. They

found that Ant_Miner was comparable in accuracy, and formed simpler rules. The

best result of Holden and Freitas’s study is 81.0% of accuracy when using WordNet

generalization with Title features. Aghdam et al. (2009) have proposed an ACO

based Feature selection algorithm for text classification. The performance of the

proposed algorithm is compared with the performance of a genetic algorithm,

information gain and CHI on the task of feature selection in Reuters-21578 dataset

(http://archive.ics.uci.edu/ml/databases/reuters21578). Their simulation results on

Reuters-21578 dataset showed the superiority of the proposed algorithm.

http://archive.ics.uci.edu/ml/databases/reuters21578)


15

Another nature inspired algorithm which is used in Web page classification

problem is the PSO. Wang et al. (2007) have used the PSO in Web page

classification task as a classifier method. They have used entropy weighting for

feature selection and PSO for classification on the Reuter

(http://archive.ics.uci.edu/ml/databases/reuters21578) and the TREC

(http://trec.nist.gov/data.html) data sets. The proposed algorithm yields much better

performance than other conventional algorithms. Liangtu and Xiaoming (2007) have

used the PSO in Web text feature extraction problem. They have used Vector Space

Model (VSM) as description of Web text. The algorithm is based on the PSO with

reverse thinking particles and structure of the particles is also improved. According

to their experimental results, their study’s accuracy value changes between 88.9%

and 96.1% depending upon the classified text size.

http://archive.ics.uci.edu/ml/databases/reuters21578)

http://trec.nist.gov/data.html)


16

3.MATERIAL AND METHOD Esra SARAÇ

17

3. MATERIAL AND METHOD

3. 1. Material

This section includes explanations about the datasets namely the WebKB

and the Conference datasets; the Weka classification environment and the ACO

method which were used in the proposed algorithm.

3.1.1. WebKB Dataset

The WebKB dataset is a set of Web pages collected by the World Wide

Knowledge Base (Web->Kb) project of the CMU (http://www.cs.cmu.edu) text

learning group, and were downloaded from The 4 Universities Dataset Homepage

(http://www.cs.cmu.edu/~webkb/). These pages were collected from computer

science departments of various universities in 1997, and manually classified into

seven different classes namely: student, faculty, staff, department, course, project,

and other. For each class, the collection contains pages from four universities

which are Cornell, Texas, Washington, Wisconsin universities, and other

miscellaneous pages collected from other universities.

The 8,282 pages were manually classified into the seven categories such

that the student category has 1641 pages, faculty has 1124, staff has 137,

department has 182, course has 930, project has 504, and other contains 3764

pages (Table 3.1). The class other is a collection of pages that are not deemed as

the “main page” and are not representing an instance of the previous six classes.

Table 3.1 Distribution of Each Class Class Student Faculty Staff Department Course Project Other # of

Pages 1641 1124 137 182 930 504 3764

The WebKB dataset includes 867 Web pages from Cornell University, 827

pages from Texas University, 1205 pages from Washington University, 1263

http://www.cs.cmu.edu)

http://www.cs.cmu.edu/~webkb/)


18

pages from Wisconsin University and finally 4,120 miscellaneous pages from

other universities (Table 3.2.).

Table 3.2. Distribution of Pages With Respect to Universities University Cornell Texas Washington Wisconsin Other

# of Pages 867 827 1205 1263 4, 120

3.1.2. Conference Dataset

The Conference dataset consists of the Computer Science related

conference homepages that were obtained from the DBLP web site

(http://www.informatik.uni-trier.de/~ley/db/). The conference Web pages were

labeled as positive documents in the dataset. To complete the dataset, the short

names of the conferences were queried using the Google search engine manually

(http://www.google.com), and the irrelevant pages in the result set were taken as

negative documents. The dataset consists of 824 relevant pages and 1545

irrelevant pages which are approximately 3 times of the relevant pages.

Distribution of Conference dataset pages is seen in Table 3.3.

Table 3.3. Distribution of Conference Dataset Pages

Relevant Pages

Non-relevant Pages

Conference 824 1545

3.1.3. Ant Colony Optimization

Ant Colony Optimization (ACO) studies artificial systems that take

inspiration from the behavior of real ant colonies and it is used to solve discrete

optimization problems. In 1999, the Ant Colony Optimization metaheuristic was

defined by Dorigo et al. (Dorigo et al., 1999). The first ACO system was

introduced by Marco Dorigo in his Ph.D. thesis (1992), and was called as Ant

System (AS). The AS is the result of a research on computational intelligence

http://www.informatik.uni-trier.de/~ley/db/)

http://www.google.com)


19

approaches to combinatorial optimization (Dorigo et al., 1992). The AS was

initially applied to the travelling salesman problem, and to the quadratic

assignment problem.

The original AS was motivated by the natural phenomenon that ants deposit

pheromone on the ground in order to mark some favorable path that should be

followed by other members of the colony.

Natural behaviors of ants are shown in Figure 3.1. The aim of the colony is

to find the shortest path between a food source and the nest. The behaviors of ants

can be listed as follows:

1. The first ant finds the food source (Food), via any way (a), then returns to

the nest (Nest), leaving behind a trail pheromone (b)

2. Ants indiscriminately follow four possible ways, but the strengthening of

the runway makes it more attractive as the shortest tour.

3. Ants take the shortest tour; long portions of other ways lose their trail

pheromones.

Figure 3.1. Figure of Ant Behavior

(http://en.wikipedia.org/wiki/Ant_colony_optimization)

Ants use the environment as a medium of communication. They exchange

information indirectly by depositing pheromones, all detailing the status of their

work. The information exchanged has a local scope, only an ant located where the

http://en.wikipedia.org/wiki/Ant_colony_optimization)


20

pheromones were left has a notion of them. The mechanism is a good example of

a self-organized system. This system is based on positive feedback (the deposit of

pheromone attracts other ants that will strengthen it themselves) and negative

feedback (dissipation of the route by evaporation prevents the system from

thrashing). Theoretically, if the quantity of pheromone remained the same over

time on all edges, no route would be chosen. However, because of feedback, a

slight variation on an edge will be amplified and thus allow the choice of an edge.

The algorithm will move from an unstable state in which no edge is stronger than

another, to a stable state where the route is composed of the strongest edges.

The basic philosophy of the algorithm involves the movement of a colony of

ants through the different states of the problem influenced by two local decision

policies, viz., trails and attractiveness. Thereby, each such ant incrementally

constructs a solution to the problem. When an ant completes a solution, or during

the construction phase, the ant evaluates the solution and modifies the trail value

on the components used in its solution. This pheromone information will direct

the search of the future ants. Furthermore, the algorithm also includes two more

mechanisms, trail evaporation and daemon actions. Trail evaporation reduces all

trail values over time thereby avoiding any possibilities of getting stuck in local

optima. The daemon actions are used to bias the search process from a non-local

perspective.

ACO algorithms can be applied to any optimization problem, for which the

following problem-dependent aspects can be defined (Bonabeau et al., 1999:

Dorigo et al., 1996):

• An appropriate graph representation to represent the discrete search space.

A graph should accurately represent all states and transitions between

states. A solution representation scheme also has to be defined.

• Positive feedback process; that is, a mechanism to update pheromone

concentrations such that current successes positively influence feature

solution construction.

• A constraint-satisfaction method to ensure that only feasible solutions are

constructed.


21

• A solution construction method which defines the way in which solutions

are built and a state transition probability.

The first ACO algorithm was applied to the travelling salesman problem

(TSP). This is the problem of finding the shortest tour visiting all the nodes of a

fully-connected graph, the nodes of which represent locations, and the arcs

represent a path with an associated cost (normally assumed to be distance). This

problem has a clear analogy with the shortest path finding ability of real ants, and

is also a widely studied NP-hard combinatorial optimization problem. Dorigo

applied an ACO to the TSP with his ‘Ant System’ (AS) approach (Dorigo, 1992).

In AS each (artificial) ant is placed on a randomly chosen city, and has a memory

which stores information about its route so far (a partial solution), initially this is

only the starting point. Setting off from its starting city an ant builds a complete

tour by probabilistically selecting cities to move next until all cities have been

visited. While at city i, an ant k picks an unvisited city j with a probability given

by equation 3.1.

( , ) = [ ( , )] ∗[ŋ( , )] ∑ [ ( , )] ∈ ∗[ŋ( , )] (3.1)

In equation 3.1, ŋ(i,j)=1/d(i,j), where d(i,j) is the distance between cities i

and j, and represents heuristic information available to the ants. is the

‘feasible’ neighborhood of ant k, that is all cities as yet unvisited by ant k. τ(i,j) is

the pheromone trail value between cities i and j. α and β are parameters which

determine the relative influence of the heuristic and pheromone information, if α

is 0 the ants will effectively be performing a stochastic greedy search using the

‘nearest-neighbor’ heuristic, if β is 0 then the ants use only pheromone

information to build their tours. After all the ants have built a complete tour the

pheromone trail is updated according to the global update rule defined in equation

3.2.

( , ) = ∗ ( , ) + ∑ ∆ ( , ) (3.2)


22

where ρ denotes a pheromone evaporation parameter which decays the pheromone

trail (and thus implements a means of ‘forgetting’ solutions which are not

reinforced often), and m is the number of ants. The specific amount of pheromone,

∆τk(i,j), that each ant k deposits on the trail is given by equation 3.3.

∆ ( , ) = 1 ⁄0 ( , ) (3.3)

In equation 3.3 is the length of ant k’s tour. This means that the shorter

the ant’s tour, the more pheromone will be deposited on the arcs used in the tour,

and these arcs will thus be more likely to be selected in the next iteration.

The algorithm iterates through each of these stages until the termination

criteria are met. The results from the AS were encouraging, but it did not improve

on state of the art approaches. This original system has since been adapted in

various ways, including using an elitist strategy which only allows the iteration of

global best ant to leave pheromone (known as Max-Min AS (Stützle and Hoos,

2000)), and allowing the ants to leave pheromone as they build the solution

(Dorigo and Stützle, 2002). Since the AS and its adaptation the ACO approach

have been modified and applied to many different problems, the implementation

details have moved some way from the original biological analogy. Dorigo and

Stützle (2002) provide a useful distillation of the ideas of various approaches to

implementing ACO algorithms, and describe the main features that any ACO

approach must define for a new problem. The five main requirements are

identified below (Dorigo and Stützle, 2002):

• A heuristic function ŋ( ), which will guide the ants’ search with problem

specific information.

• A pheromone trail definition, which states what information, is to be

stored in the pheromone trail. This allows the ants to share information

about good solutions.

• The pheromone update rule, this defines the way in which good solutions

are reinforced in the pheromone trail.


23

• A fitness function which determines the quality of a particular ant’s

solution.

• A construction procedure that the ants follow as they build their solutions

(this also tends to be problem specific).

The overall process of ACO can be seen in Figure 3.2. The process begins

by generating a number of ants which are then placed randomly on the graph.

Alternatively, the number of ants to place on the graph may be set equal to the

number of nodes within the data; each ant starts path construction at a different

node. From these initial positions, they traverse nodes probabilistically until a

traversal stopping criterion is satisfied. The resulting subsets are gathered and then

evaluated. If an optimal subset has been found or the algorithm has executed a

certain number of times, then the process halts and outputs the best solution

encountered. If none of these conditions hold, then the pheromone is updated, a

new set of ants are created and the process iterates once more.

The solution construction process is stochastic and is biased by a pheromone

model, that is, a set of parameters associated with graph components (either nodes

or edges) whose values are modified at runtime by the ants. Thus, ACO algorithm

can be applied to any problem which can be represented as a graph or a node

format. Basic flow of the algorithm is illustrated in Figure 3.2. Diversity of

algorithm comes from the different stochastic selection methods which are used in

subset evaluation section in the flow chart. Roulette wheel and ranking method

are the examples of these stochastic selection methods (Bäck and Thomas, 1996).

In roulette wheel selection algorithm, which is a stochastic algorithm, the

individuals are mapped to contiguous segments of a line, such that each

individual's segment is equal in size to its fitness. A random number is generated

and the individual whose segment spans the random number is selected. The

process is repeated until the desired number of individuals is obtained. This

technique is similar to a roulette wheel with each slice proportional in size to the

fitness. In rank-based selection algorithms, the population is sorted according to

the objective values. The fitness assigned to each individual depends only on its

position in the individuals rank and not on the actual objective value. After


24

sorting, a random number is generated and the individual whose order spans the

random number is selected.

Figure 3.2. ACO Flow Chart

3.1.4. Weka Data Mining Tool

Weka (Waikato Environment for Knowledge Analysis)

(http://www.cs.waikato.ac.nz/ml/weka) is a popular suite of machine learning

software written in Java, developed at the University of Waikato, New Zealand.

Weka is a free software available under the GNU General Public License. The

starting graphical user interface of Weka is shown in Figure 3.3. Weka supports

http://www.cs.waikato.ac.nz/ml/weka)


25

several standard data mining tasks, more specifically, data preprocessing,

clustering, classification, regression, visualization, and feature selection.

Figure 3.3. The starting graphical user interface of Weka

Weka has 4 different working modes namely Explorer, Experimenter,

KnowledgeFlow and Simple CLI (Witten and Frank, 2005). Simple CLI

provides a simple command-line interface that allows direct execution of Weka

commands for operating systems that do not provide their own command line

interface. Experimenter is an environment for performing experiments and

conducting statistical tests between learning schemes. Knowledge Flow supports

essentially the same functions as the Explorer but with a drag-and-drop interface.

One advantage is that it supports incremental learning. Explorer is an

environment for exploring data with Weka (Witten and Frank, 2005). Explorer is

the most frequently used environment of Weka and it is shown in Figure 3.4.


26

Figure 3.4. Explorer Environment of Weka

Explorer environment has 7 different tabs namely Preprocess, Classify,

Cluster, Associate, Select attributes and Visualize. In the Preprocess tab, data to

be analyzed is chosen and modified. The 71 algorithms available in Classify tab of

Weka are grouped into 6 categories, namely, Bayes (Bayesian algorithms),

Functions (function algorithms such as logistic regression and SVMs), Lazy (lazy

algorithms or instance based learners), Meta (algorithms that combine several

models and in some cases models from different algorithms), Trees

(classification/regression tree algorithms) and Rules (rule based algorithms). For

clustering, Cluster tab can be used to learn clusters for the data. Association rules

are learned in Associate tab. Select attributes tab includes attribute selection

methods. Finally, with Visualize tab, 2D plot of the data can be viewed.

In order to maintain format independence, data is converted to an

intermediate representation called ARFF (Attribute Relation File Format). ARFF

files contain blocks describing relations and their attributes, together with all the

instances of the relation and there are often very many of these. They are stored as

plain text for ease of manipulation. Relations are simply a single word or string


27

naming of the concept to be learned. Each attribute has a name, a data type (which

must be one of enumerated, real or integer) and a value range (enumerations for

nominal data, intervals for numeric data). The instances of the relation are

provided in comma-separated form to simplify interaction with spreadsheets and

databases. Missing or unknown values are specified by the ‘?’ character. An

example arff file is shown in Figure 3.5. In this study, Weka was used in the

classification phase.

Figure 3.5. iris.arff File as an Example of arff File

(http://www.cs.waikato.ac.nz/ml/weka)

3.2. Method

This section includes explanations about the proposed algorithm. Detailed

information about the datasets used, the ACO algorithm applied for feature

selection and the classification algorithm used for fitness function evaluation were

given in this section. General steps of the proposed system are shown in Figure

3.6.

@RELATION iris @ATTRIBUTE sepallength REAL @ATTRIBUTE sepalwidth REAL @ATTRIBUTE petallength REAL @ATTRIBUTE petalwidth REAL @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica} @DATA 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 6.3,2.5,5.0,1.9,Iris-virginica 6.5,3.0,5.2,2.0,Iris-virginica 6.2,3.4,5.4,2.3,Iris-virginica 5.9,3.0,5.1,1.8,Iris-virginica

http://www.cs.waikato.ac.nz/ml/weka)


28

Figure 3.6. Architecture of The Proposed System

The proposed method includes four main steps namely construction of

datasets, feature extraction, feature selection and classification. In dataset

construction phase, datasets are prepared according to binary class classification

problem. After preparation step, features are extracted from these datasets. In

feature selection phase, a subset of features is selected among the extracted

features. And finally selected features are sent to Weka for classification. Feature

selection and classification phases are repeated until the best feature set is

selected. These steps are explained in detail in the following sections.


29

3.2.1. Construction of Dataset

In this work, two different datasets namely the WebKB and the Conference

datasets were used. The first of these is the WebKB which was described in

section 3.1.1. From the WebKB dataset Project, Faculty, Student and Course

classes were used in this study. As Staff and Department classes have insufficient

number of positive examples, they were not considered in this thesis. Training and

test datasets were constructed as described in the WebKB project Web site

(http://www.cs.cmu.edu). For each class, training set includes relevant pages

which belong to randomly chosen three universities, and other class of the dataset.

The fourth university’s pages were used in the test phase. Approximately 75%

of the irrelevant pages were added into the training set and 25% of the

irrelevant pages were included into the test set. The detailed information about the

dataset, which was used in this study, is given in Table 3.4. For

example, the Course class includes 846 relevant and 2822 irrelevant pages for the

training phase, in the test phase of Course class, 86 pages were relevant

and 942 pages were irrelevant.

Table 3.4. Train/Test Distribution of WebKB Dataset for Binary Class Classification

Train Relevant/Non-relevant

Test Relevant/Non-relevant

Course 846 / 2822 86 / 942 Project 840 / 2822 26 / 942 Student 1485 / 2822 43 / 942 Faculty 1084 / 2822 42 / 942

In Conference dataset, approximately 75% of the both relevant and

irrelevant pages were added into the training set and 25% of the relevant and

irrelevant pages were included into the test set. The detailed information about the

Conference dataset, which was used in this study, is given in Table 3.5. So that

618 relevant and 1159 irrelevant pages were used in the training phase and in the

test phase, 206 pages were relevant and 386 pages were irrelevant.

http://www.cs.cmu.edu)


30

Table 3.5. Train/Test Distribution of the Conference Dataset

Train Relevant/Non-relevant

Test Relevant/Non-relevant

Conference 618 / 1159 206 / 386

3.2.2. Feature Extraction

In the feature extraction phase all <title>, <h1>, <a>, <b>, <p>, and <li>

tags which denote title, header at level 1, anchor, bold, paragraph, list item; and

URL addresses of pages were used. According to the experimental results of the

earlier studies (Kim and Zhang, 2003: Ribeiro et al., 2003: Trotman, 2005), these

tags are meaningful for feature extraction. To extract features, all the terms from

each of the above cited tags and URL addresses of the relevant pages in the train

set were taken. After term extraction, stopword removal and Porter’s stemming

algorithm (Porter, 1980) were applied. Each stemmed term was gathered for the

feature set.

The number of features varies based on the dataset, selected tags (i.e. only

URL addresses, only <title> tags, all tags or all terms) and selected document

frequency value. Number of features for each class of all datasets with respect to

the selected tags are shown in Table 3.6. As an example, 33998 features were

extracted when all of the above mentioned tags and URL are used for the Course

class. When only the <title> tags were considered, the number of features reduced

to 305 for this class. As shown in Table 3.6., the number of features were too

large for the Weka software to be used since Weka processes all data in-memory.

So we applied document frequency feature selection method (Salton and Buckley,

1988) to reduce the number of features so that Weka can handle them. Document

frequency of a feature is defined as the number of positive documents in the

training dataset that contains the feature (Baeza-Yates and Ribeiro-Neto, 1999). In

this study, features whose document frequencies are at least approximately 5%,

10% and %15 were chosen since according to the Salton and Buckley (1988)

these features are good. After that the ACO algorithm was used for feature

selection from this feature pool.


31

Table 3.6. Number of Features for all Classes According to the Selected Tags

Tagged Terms

Bag of Terms

Title Tag

URL Course 33998 16421 305 476 Project 31542 15466 596 686 Student 51009 22417 1987 1557 Faculty 48584 24756 1502 1208

Conference 3667 18788 890 1115

3.2.3. Feature Selection

ACO algorithm was used for feature selection. According to the

experimental results of the preliminary studies on ACO algorithm (Dorigo and

Stützle, 2002), the optimum number of ants is defined as 5. Therefore, in this

thesis, 5 ants were used. ACO algorithm which was described in section 3.1.2,

was adopted for Web page feature selection problem. Before selection of features,

number of features was reduced by their document frequency. It means that more

frequent features were selected before the ACO algorithm was applied.

In this study, features were selected from four different feature groups for

each class. In the first group, features were extracted from only the URL addresses

of Web pages. In URL address features, document frequency elimination was not

used because the number of features extracted was not too large. The detailed

information about the number of features for all classes according to the URL

addresses was given in Table 3.6. Secondly, only <title> tags were used for

feature extraction. Features that were extracted from the <title> tags were not too

large. So, document frequency elimination was not used, too. The number of

features for all classes by using the <title> tags was given in Table 3.6. In the third

feature extraction method, all terms were used as features without their tag

properties. In other words, a term which appeared in the document regardless of

its position was taken as a feature. This feature extraction approach is called as

‘‘bag-of-terms” method. In bag-of-terms method, number of features was very

large, so document frequency factor was used as a reducer. In this case, the

number of features that were extracted for all classes according to the their

document frequency values are seen in Table 3.7.

class method


32

Table 3.7. Number of Features for all Classes In Bag-of-Terms Method with Respect to Document Frequency Values

Class

5% Document

Frequency

10% Document

Frequency

15% Document

Frequency Course 459 217 130 Project 241 89 54 Student 292 121 70 Faculty 386 194 107

Conference 492 245 141

Finally, all terms that appeared in all tags were used as features. In other words, a

term which appeared in different tags was taken as different features (i.e. tagged

terms). Number of features for this case is also too large, so document frequency

values were used to reduce the number of features. The same document frequency

values which were used in the previous method were used. The detailed

information about the number of features for all classes in tagged terms form with

respect to document frequency values is given in Table 3.8.

Table 3.8. Number of Features for all Classes in Tagged Term Method with Respect to Document Frequency Values

Class

5% Document

Frequency

10% Document

Frequency

15% Document

Frequency Course 757 326 193 Project 324 115 66 Student 450 169 98 Faculty 603 259 140

Conference 831 370 201

After the feature extraction step, the optimum subset of features was tried to

be selected with an ACO algorithm. In the proposed method, each feature

represents a node, and all nodes are independent. Nodes (i.e. features) were

selected according to their selection probability Pk(i) which is given in equation

3.4.

( ) = [ ( )] ∗[ŋ( )] ∑ [ ( )] ∈ ∗[ŋ( )] (3.4)


33

In equation 3.4, ŋ(i)=df(i) , where df(i) is the document frequency of feature

i, and represents heuristic information available to the ants. is the ‘feasible’

neighborhood of ant k, that is all features as yet unvisited by ant k. τ(i) is the

pheromone trail value of feature i. The initial pheromone values are selected as

10. α and β are parameters which determine the relative influence of the heuristic

and pheromone information, and both of them are selected as 1.

Previous studies have shown that 1 is the most appropriate value for α and β and

10 is a suitable value for parameter initial pheromone trail value (Dorigo and

Stützle, 2002). After all the ants have built a complete tour, the pheromone trail is

updated according to the global update rule which is defined in equation 3.5.

( ) = ∗ ( ) + ∑ ∆ ( ) (3.5)

where ρ denotes a pheromone evaporation parameter which decays the pheromone

trail and m is the number of ants. ρ value is selected as 0.2 (Dorigo and Stützle,

2002).The specific amount of pheromone, ∆τk(i), that each ant k deposits on the

trail is given by equation 3.6.

∆ ( ) = ∗ 2 ∗ if node ( ) is used by ant and is the elithist ant ∗ if node ( ) is used by ant 0 otherwise (3.6)

In equation 3.6, is the F-measure value of ant k’s subset, and Bk is the

unit pheromone values. This means that the higher the F-measure of the ant’s

subset, the more pheromone will be deposited on the arcs used in the subset, and

these arcs will thus be more likely to be selected in the next iteration.

Until each ant chooses a predefined number of features, selection

probability of each unselected node was evaluated by equation 3.4. After the

probability evaluation, a roulette wheel selection algorithm was used for selecting

the next feature (Bäck and Thomas, 1996). The flow chart of the proposed feature

subset selection algorithm is shown in Figure 3.7.


34

Figure 3.7. Flow Chart of Proposed ACO Algorithm

When all ants complete their subset selection process, two arff files were

generated for each ant (train and test phase). In arff files of @data section, each

row represents a Web page. And each value in each row shows frequency of that

feature. If Web pages relevant, rows end with R, else if Web pages irrelevant rows

end with N. By this way, arff files were generated. Obtained arff files were

classified with WEKA. An example for an arff file is shown in Figure 3.8. In

Figure 3.8. in row @attribute 14 real, 14 denotes the number of the feature (i.e.

index of the feature).


35

Figure 3.8. An Instance of an arff File

A well known algorithm C4.5 was used for classification. C4.5 is a well

known classification algorithm which is based on decision trees. C4.5 is an

extension of Quinlan's earlier ID3 algorithm (Quinlan, 1993). The decision trees

generated by C4.5 can be used for classification, and for this reason, C4.5 is often

referred to as a statistical classifier. C4.5 builds decision trees from a set of

training data in the same way as ID3, using the concept of information entropy.

C4.5 is tried to find small decision trees by pruning. Pseudo code of general C4.5

algorithm is given in Figure 3.9.

J48 is an open source Java implementation of the C4.5 algorithm in

the Weka data mining tool. And results were compared with respect to F-measure

(Van Rijsbergen, 1979) values. Formulation of Weka F-measure value is given in

Equation 3.7.


36

Figure 3.9. Pseudo Code of General C4.5 algorithm (Quinlan, 1993).

− = ∗ ∗ (3.7)

In Equation 3.7, recall is the ratio of relevant documents found in the search result

to the total number of all relevant documents and precision is the proportion of

relevant documents in the results returned. In earlier studies, researchers were

measured their studies performance with respect to F-measure. To comply with

the standards on this issue, F-measure value was chosen as a performance metric

in this study. As a result of classification, an ant has chosen as elitist ant which

gives the best result. After that process, pheromone values were updated based on

equations 3.5 and 3.6. The process is repeated the predetermined epoch number of

times. Finally, the best feature subset has chosen as optimum subset which has the

best F-measure value.

1. Check for base cases

2. For each attribute a

2.1. Find the normalized information gain from splitting an a

3. Let a_best be the attribute with highest normalized information gain

4. Create a decision node that splits on a_best

5. Recur on the sublists obtained by splitting on a_best, and add those

nodes as children of node

4. RESEARCH AND DISCUSSION Esra SARAÇ

37

4. RESEARCH AND DISCUSSION

In this section experiments performed and their results are presented. Perl

programming language (http://www.perl.org/) was used for whole feature

extraction phase and document frequency part of feature selection phase. ACO

feature selection algorithm was programmed with Java programming language

under Eclipse environment (http://www.oracle.com/technetwork/developer-

tools/eclipse/downloads/index.html). The proposed method was tested under

Microsoft Windows XP SP3 operating system. The hardware used in the

experiments had 1 GB of RAM and Intel Core2Duo 1.60 GHz processor. The

proposed method consists of two main parts. The first part is extracting features

explained in the previous section and the second one is selecting optimal feature

subset from these features with an ACO algorithm. Suggested method was tested

on the Conference and the WebKB (http://www.cs.cmu.edu/~webkb/) datasets.

Detailed information about the datasets were given in the previous section. The

proposed selection method has been run for 250 times for each class. In other

words, according to experimental results, epoch number of selection method was

defined as 250. After 250 epoch, there was no improvement on classification

results. So 250 has been the optimum epoch value for the proposed method.

Classification results of an experiment up to 500 epoch is shown in Table 4.1.

Table 4.1. Classification Results of Student Class with Respect to 500 Epoch Value

# of Epoch Max/Min F-Measure

1 0.979/0.890

125 0.978/0.969

250 0.977/0.972

375 0.933/0.9277

500 0.933/0.9267

Run Time in Seconds 30.08

Classification results of an experiment on 250 epoch is shown in Table 4.2.

http://www.perl.org/)

http://www.oracle.com/technetwork/developer

http://www.cs.cmu.edu/~webkb/)


38

Table 4.2. Classification Results of Student Class with Respect to 250 Epoch Value

# of Epoch Max/Min F-Measure

1 0.979/0.890

125 0.978/0.969

250 0.977/0.972

Run Time in Seconds 15.04

Given results in Table 4.1. are belong to the Student class, bag of terms method

under %15 document frequency value. Number of selected feature is defined as 70

in these results. According to the experimental results, after 250 epoch, although,

there was no improvement on F-Measure value, run time of method was doubled.

So, epoch number was defined as 250.

In this study, each ant chooses predefined number of features. These

number of features were determined by total number of features of each class. In

bag-of terms and tagged terms methods, after document frequency selection

phase, half of minimum feature number was taken as a limit of selected features

for all classes. For example, as seen in Table 3.8, the Conference dataset has 492,

245 and 141 features based on 5%, 10% and 15% document frequency values

respectively, on bag of terms method. Approximately half of 141 which was the

minimum feature number of Conference dataset in this case (i.e. 70), was taken as

upper limit of selected number of features. In URL and <title> tags methods,

document frequency reduction technique was not applied to datasets, for this

reason, 20% of the minimum feature number was taken as a upper limit of

selected features for all classes. For example, as seen in Table 3.6, the Course

dataset has 305 features on title tags method. Approximately 20% of 305 which

was the minimum feature number of Course dataset in this case (i.e. 60), was

taken as upper limit of selected number of features. The purpose of this study is to

minimize the number of features; therefore, after this determination, predefined

upper limits have reduced orderly. Detailed information about selected number of

features are given specifically in each section of each experiment.


39

In the previous study (Saraç and Özel, 2010), selected features were

classified with Navie Bayes, RBF (Poggio and Girosi, 1990), and C4.5

classification algorithms of Weka data mining tool, and results of this comparison

was given in Table 4.3.

Table 4.3. F-measures of NB, RBF, and C4.5 Classifiers for the WebKB Dataset

Course Project Faculty Student NB 0.149 0.947 0.097 0.1 RBF 0.871 0.959 0.926 0.775 C4.5 0.877 0.962 0.947 0.793

According to the Table 4.3., C4.5 classification algorithm was chosen in this study

for classification of Web pages, since C4.5 classifier had the highest classification

F-measure.

4.1. Classification Experiments With Only URL Addresses

Performance of the proposed method with only URL addresses of Web

pages was considered on a preferential basis. For all classes, m value which has

been shown in Figure 3.7. was defined according to the total number of features.

Total number of features have been shown in Table 3.6. To make a comparison

between URL and <title> tags methods, same number of features are used in both

cases. So, to define m value, number of features extracted from <title> tag and

URL are considered. According to the Table 3.6., the Course class at title tag

method has the minimum feature number as 305. So, limit m value was defines as

60. Since 60 is approximately 20% of 305. After this predefinition, the proposed

algorithm was tested under 30 and 10 number of features. In the first experiment,

each ant has selected 60 features from whole features for all classes, and Web

pages are classified with respect to these selected 60 features. Classification

results in F-Measure value for 60 features are given in Table 4.4.


40

Table 4.4. Experimental Results Using URLs With 60 Features of All Classes # of

epoch Course Project Student Faculty Conference

1 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.817/0.745

125 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.835/0.817

250 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.835/0.826

Run Time

(min/sc) 12:45 06:11 04:23 07:15 12:40

Run time of the experiments are also can be seen in Table 4.4. In the

second experiment, each ant has selected 30 features from whole features for all

classes and Web pages are classified with respect to these selected 30 features.

Classification results in F-Measure value for 30 features are given in Table 4.5.



1 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.798/ 0.606

125 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.824/ 0.812

250 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.824/ 0.792

Run Time

(min/sc) 06:24 06:11 02:16 03:47 06:52

And finally, each ant has selected 10 features from whole features for all




41



1 1.0/0.98654 1.0/1.0 1.0/0.82201 1.0/1.0 0.767/0.545

125 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.780/0.746

250 1.0/1.0 1.0/1.0 1.0/1.0 1.0/1.0 0.780/ 0.751

Run Time

(min/sc) 02:31 01:12 01:05 01:32 02:22

In the given experimental results, tables include maximum and minimum

F-Measure values at defined iteration number. So, 0.767/0.545 value in Table 4.6.

specifies the maximum and the minimum F-Measure values of the Conference

dataset at the first iteration.

According to the obtained experimental results, WebKB dataset includes

meaningful URL addresses. But the Conference dataset has not meaningful URL

addresses. There were no noticeable change between different number of features

for WebKB dataset. Reduced number of features only change the run time of the

algorithm. But in the Conference dataset, with reducing the number of features F-

Measure values were also decreased. Limited number of features, although

advantageous in terms of time, a disadvantage in terms of classification

performance for the Conference dataset.

4.2. Classification Experiments With Only <title> Tags

Performance of the proposed method with only <title> tags of Web pages

was considered in this section. Same as the URL address tests, in the first

experiment, each ant has selected 60 features from whole features for all classes,

and Web pages are classified with respect to these selected 60 features.

Classification results in F-Measure value for 60 features are given in Table 4.7. In

Table 4.7., # E, denotes the epoch number and RT, denotes the run time of the

algorithm in terms of minutes and seconds.


42

Table 4.7. Experimental Results Using <title> Tags With 60 Features of All Classes

# E Course Project Student Faculty Conference

1 0.880/0.871 0.983/0.983 0.917/0.913 0.940/0.932 0.741/0.698

125 0.880/0.874 0.983/0.983 0.911/0.911 0.935/0.922 0.721/0.718

250 0.874/0.874 0.983/0.983 0.913/0.911 0.935/0.926 0.724/0.717

R T 29:05 11:21 21:41 25:17 11:38

Run time of the experiments can be seen in Table 4.7. In the second

experiment, each ant has selected 30 features from whole features for all classes

and Web pages are classified with respect to these selected 30 features.




1 0.876/0.869 0.983/0.976 0.920/0.917 0.939/0.932 0.710/0.687

125 0.882/0.875 0.983/0.983 0.911/0.911 0.927/0.922 0.730/0.721

250 0.880/0.877 0.983/0.983 0.917/0.911 0.927/0.926 0.732/0.718

R T 13:58 05:54 11:53 12:56 05:40

And finally, each ant has selected 10 features from whole features for all



According to the obtained experimental results, the WebKB dataset

includes meaningful title declarations. But the Conference dataset’s title

declarations are not meaningful as the WebKB dataset’s. In this experiment,

reduced number of features were affected F-Measure values negatively. In the

previous experiment, short and meaningful information about Web pages are

extracted from URL addresses. So, favorite features are selected in all cases. But

in <title> tag features, number of features are more than URL addresses features.


43

When m value reduced, selection probability of meaningful features are reduced.

Limited number of features, although advantageous in terms of time, a

disadvantage in terms of classification performance for the Conference dataset.

The number of pages also important for run time of the algorithm, the Conference

and the Project classes have a minimum number of pages this means minimum

number of cycle so, run time of these classes are fewer than the others.



1 0.882/0.866 0.983/0.948 0.917/0.886 0.946/0.930 0.698/0.687

125 0.877/0.875 0.983/0.983 0.920/0.917 0.940/0.936 0.736/0.710

250 0.877/0.875 0.983/0.983 0.917/0.917 0.940/0.935 0.736/0.699

RT 02:20 02:19 05:06 04:00 02:13

4.3. Classification Experiments With Bag of Terms Method

Performance of the proposed algorithm with bag of terms method of Web

pages was considered in this section. Number of features of all classes can be seen

in Table 3.6. In bag of terms method, number of features are very high, because

of this, document frequency feature elimination method was applied in this

experiment. The m value of each class was determined by their number of features

which was shown in Table 3.7. According to the these number of features, the

upper limit of the m value was defined as the approximately half of minimum

feature number. Such as, for Conference class, minimum feature number is 141

so, upper limit of m value was defined as 70. For Course class upper limit of m

value was defined as 60, for Project and Student classes upper limit of m value

was defined as 30, and finally for Faculty class upper limit of m value was defined

as 50. After determination of upper limits of m values, number of features were


44

reduced according to the upper limits. Experimental results were given in three

parts, with respect to document frequencies.


Document Frequency Value

Specified number of features and classification results for 5% document

frequency value are discussed in this section. Each ants were run with three

different number of features. For the Course class in the first experiment, each ant

selected 60 features. In the second experiment, each ant selected 40 features. In

the final experiment, each ant selected 10 features. We have defined a fixed

number of features to compare the classification performance of all classes under

a fixed number of features. So, the aim of the final experiment is to provide

comparison between different classes. Because of this, there was an experimental

result with 10 features for all classes. Specified number of features and

classification results of the Course class for 5% document frequency value are

shown in Table 4.10.

Table 4.10. Experimental Results Using Bag of Terms Method for Course Class With 5% Document Frequency

# of Features for Course Class # E 60 40 10 1 0.980/0.851 0.985/0.974 0.977/0.799

125 0.975/0.958 0.964/0.959 0.982/0.915 250 0.975/0.958 0.981/0.964 0.982/0.914 R T 08:50 06:08 03:40

For the Project class in the first experiment each ant selected 30 features.

In the second experiment, each ant selected 20 features. In the final experiment,

each ant selected 10 features. Specified number of features and classification

results of Project class for 5% document frequency value are shown in Table 4.11.


45

Table 4.11. Experimental Results Using Bag of Terms Method for Project Class With 5% Document Frequency

# of Features for Project Class # E 30 20 10 1 0.981/0.887 0.994/0.958 0.951/0.792

125 0.973/0.977 0.988/0.963 0.994/0.953 250 0.976/0.973 0.992/0.963 0.994/0.956 R T 05:31 03:05 02:50

For the Student class in the first experiment each ant selected 30 features.



results of the Student class for 5% document frequency value are shown in Table

4.12.

Table 4.12. Experimental Results Using Bag of Terms Method for Student Class With 5% Document Frequency

# of Features for Student Class # E 30 20 10 1 0.982/0.688 0.983/0.920 0.865/0.820

125 0.968/0.950 0.985/0.949 0.984/0.854 250 0.968/0.962 0.983/0.891 0.988/0.979 R T 07:41 07:04 05:02

For the Faculty class in the first experiment each ant selected 50 features.



results of the Faculty class for 5% document frequency value are shown in Table

4.13.


46

Table 4.13. Experimental Results Using Bag of Terms Method for Faculty Class With 5% Document Frequency

# of Features for Faculty Class # E 50 30 10 1 0.978/0.835 0.958/0.930 0.990/0.887

125 0.981/0.983 0.989/0.877 0.990/0.959 250 0.978/0.972 0.991/0.983 0.993/0.981 R T 11:00 07:26 04:15

For the Conference class in the first experiment each ant selected 70

features. In the second experiment, each ant selected 50 features. In the final

experiment, each ant selected 10 features. Specified number of features and

classification results of the Conference class for 5% document frequency value are


Table 4.14. Experimental Results Using Bag of Terms Method for Conference Class With 5% Document Frequency

# of Features for Conference Class # E 70 50 10 1 0.992 / 0.952 0.991/0.910 0.994/0.911

125 0.987/0.984 0.992/0.973 0.994/0.970 250 0.987/0.984 0.992/0.985 0.992/0.978 R T 08:02 05:31 03:35



Classification results for 10% document frequency value are discussed in

this section. Same number of features were used in all document frequency

values. So, each ant run with three different number of features in this case.

Classification results of the Course class for 10% document frequency value with

respect to number of features are shown in Table 4.15.


47



125 0.986/0.964 0.986/0.884 0.984/0.896 250 0.986/0.964 0.964/0.964 0.983/0.888 R T 08:30 05:54 03:39

Classification results of the Project class for 10% document frequency

value with respect to number of features are shown in Table 4.16.



125 0.993/0.963 0.992/0.963 0.993/0.955 250 0.977/0.973 0.995/0.973 0.992/0.987 R T 05:30 03:55 02:46

Classification results of the Student class for 10% document frequency




125 0.826/0.798 0.791/0.763 0.897/0.833 250 0.826/0.776 0.793/0.761 0.887/0.840 R T 20:25 15:19 07:05

Classification results of the Faculty class for 10% document frequency



48



125 0.980/0.974 0.991/0.974 0.994/0.986 250 0.978/0.971 0.989/0.972 0.993/0.946 R T 10:51 07:19 03:48

Classification results of the Conference class for 10% document frequency




125 0.985/0.984 0.992/0.985 0.992/0.977 250 0.987/0.985 0.987/0.987 0.994/0.991 R T 07:23 05:24 03:41




this section. Each ants were run with three different number of features in this

case. Classification results of the Course class for 15% document frequency value

with respect to number of features are shown in Table 4.20.



125 0.975/0.958 0.964/0.964 0.985/0.917 250 0.986/0.958 0.964/0.964 0.984/0.883 R T 07:58 04:30 03:24


49





125 0.976/0.963 0.992/0.974 0.994/0.725 250 0.988/0.962 0.993/0.977 0.991/0.934 R T 05:16 04:06 02:43





125 0.963/0.948 0.988/0.968 0.988/0.858 250 0.988/0.949 0.981/0.949 0.987/0.858 R T 07:18 05:33 02:27





125 0.980/0.970 0.977/0.973 0.989/0.984 250 0.982/0.972 0.993/0.975 0.988/0.915 R T 11:05 06:43 03:50




50



125 0.992/0.984 0.992/0.984 0.992/0.991 250 0.987/0.984 0.992/0.987 0.992/0.977 R T 07:11 05:20 03:13

According to the obtained experimental results, we can say that text in

Web pages are more meaningful than URL addresses and titles of Web pages for

the Conference dataset. But in WebKB dataset, URL addresses are more

meaningful than page contents and titles of Web pages. As in previous

experiments, F-Measure values (i.e. classification performance) were changed

with number of features. 15% document frequency value yielded the best

classification performance with the maximum feature number. With document

frequency method, meaningless features were eliminated before the ACO was

applied, and this method enforced ants to select meaningful features. Number of

features also affected the run time of the algorithm. These are inversely

proportional quantities.

4.4. Classification Experiments With Tagged Terms Method

Performance of the proposed algorithm with tagged terms method of Web

pages was considered in this section. Number of features for all classes can be

seen in Table 3.6. In tagged terms method, number of features are very large,

because of this, document frequency feature elimination method was applied in

this method. m values which are previously defined for bag of terms method, are

used for tagged terms method, too. The same values are used to make a healthy

comparison between the two methods (i.e. bag of terms method and tagged terms

method). Experimental results were given in three parts, with respect to document

frequencies.


51



Specified number of features and classification results for 5% document

frequency value are discussed in this section. Each ant was run with three

different number of features. Number of features for each tag are presented in

Table 4.25.

Table 4.25. Number of Features For Each Tag With 5% Document Frequency Value for Each Class

Course Project Student Faculty Conference

URL 19 7 16 11 10

Title 10 8 4 3 8

Header 42 14 22 30 21

Anchor 77 29 57 59 122

Bold 33 9 19 31 45

Text 457 240 291 384 488

List item 119 17 41 85 137

Total 757 324 450 603 831

For the Course class in the first experiment each ant selected 60 features.



results of the Course class for 5% document frequency value are shown in Table

4.26.

Tag Class


52

Table 4.26. Experimental Results Using Tagged Terms Method for Course Class With 5% Document Frequency


125 1.0/1.0 1.0/1.0 1.0/0.746 250 1.0/1.0 1.0/1.0 1.0/1.0 R T 10:24 08:12 02:44

For the Project class in the first experiment each ant selected 30 features.



results of the Project class for 5% document frequency value are shown in Table

4.27.

Table 4.27. Experimental Results Using Tagged Terms Method for Project Class With 5% Document Frequency


125 1.0/1.0 1.0/1.0 1.0/0.84466 250 1.0/1.0 1.0/1.0 1.0/0.83577 R T 06:42 04:05 02:20

For the Student class in the first experiment each ant selected 30 features.



results of the Student class for 5% document frequency value are shown in Table

4.28.


53

Table 4.28. Experimental Results Using Tagged Terms Method for Student Class With 5% Document Frequency


125 1.0/0.81951 1.0/1.0 1.0/0.31080 250 1.0/1.0 1.0/1.0 1.0/1.0 R T 07:41 07:04 05:02

For the Faculty class in the first experiment each ant selected 50 features.



results of the Faculty class for 5% document frequency value are shown in Table

4.29.

Table 4.29. Experimental Results Using Tagged Terms Method for Faculty Class With 5% Document Frequency


125 1.0/1.0 1.0/1.0 1.0/1.0 250 1.0/1.0 1.0/1.0 1.0/1.0 R T 09:47 06:39 03:11

For the Conference class in the first experiment each ant selected 70

features. In the second experiment, each ant selected 50 features. In the final

experiment, each ant selected 10 features. Specified number of features and

classification results of the Conference class for 5% document frequency value are


Table 4.30. Experimental Results Using Tagged of Terms Method for Conference Class With 5% Document Frequency


125 0.998/0.998 0.998/0.998 0.998/0.998 250 0.998/0.998 0.998/0.998 0.998/0.998 R T 08:26 06:32 04:49


54




this section. Same number of features were used in all document frequency value.

So, each ant was run with three different number of features in this case, too.

Number of features for each tag are presented in Table 4.31.

Table 4.31. Number of Features For Each Tag With 10% Document Frequency Value


URL 9 7 8 7 6

Title 5 3 3 3 3

Header 18 4 9 9 8

Anchor 34 12 19 19 50

Bold 8 0 3 7 13

Text 215 87 118 192 243

List item 37 2 9 22 47

Total 326 115 169 259 370

Classification results of the Course class for 10% document frequency

value with respect to specified number of features are shown in Table 4.32.



125 1.0/1.0 1.0/1.0 1.0/1.0 250 1.0/1.0 1.0/1.0 1.0/1.0 R T 10:38 08:08 02:36




55



125 1.0/1.0 1.0/1.0 1.0/0.702 250 1.0/1.0 1.0/1.0 1.0/0.729 R T 05:46 04:13 03:45



Table 4.34. Experimental Results Using tagged Terms Method for Student Class With 10% Document Frequency


125 1.0/0.785 1.0/1.0 1.0/1.0 250 1.0/1.0 1.0/0.682 1.0/0.831 R T 06:41 04:28 03:33





125 1.0/1.0 1.0/0.916 1.0/1.0 250 1.0/1.0 1.0/0.932 1.0/0.942 R T 09:29 06:08 02:57




56

Table 4.36. Experimental Results Using Tagged Terms Method for Conference Class With 10% Document Frequency

# of Features for Conference Class # E 70 50 10 1 0.998/ 0.998 0.998/ 0.998 0.998/ 0.905

125 0.998/ 0.998 0.998/ 0.998 0.998/ 0.998 250 0.998/ 0.998 0.998/ 0.998 0.998/ 0.998 R T 07:53 06:05 04:57




this section. Each ant was run with three different number of features in this case,

too. Number of features for each tag are presented in Table 4.37.

Table 4.37. Number of Features For Each Tag With 15% Document Frequency Value


URL 7 6 6 8 4

Title 4 0 3 105 2

Header 8 2 5 2 3

Anchor 21 7 11 12 29

Bold 4 0 2 4 2

Text 129 51 67 3 137

List item 20 0 4 6 24

Total 193 66 98 140 201

Classification results of the Course class for 15% document frequency

value with respect to specified number of features are shown in Table 4.38.


57



125 1.0/1.0 1.0/0.883 1.0/1.0 250 1.0/1.0 1.0/1.0 1.0/0.753 R T 11:02 08:47 02:38





125 1.0/1.0 1.0/1.0 1.0/0.845 250 1.0/1.0 1.0/1.0 1.0/0.939 R T 05:42 04:03 03:17



Table 4.40. Experimental Results Using Tagged Terms Method for Student Class With 15% Document Frequency


125 1.0/0.746 1.0/0.746 1.0/1.0 250 1.0/0.748 1.0/0.748 1.0/1.0 R T 06:21 04:10 02:32




58



125 1.0/1.0 1.0/1.0 1.0/1.0 250 1.0/1.0 1.0/1.0 1.0/0.853 R T 10:33 06:00 03:11




Class With 15% Document Frequency # of Features for Conference Class

# E 70 50 10 1 0.998/ 0.998 0.998/0.998 0.998/0.964

125 0.998/ 0.951 0.998/ 0.998 0.998/ 0.998 250 0.998/ 0.998 0.998/ 0.998 0.998/ 0.998 R T 07:49 06:02 05:00

According to the obtained experimental results, 15% document frequency

value yielded the best classification performance with the maximum feature

number. Number of features also affected the run time of the algorithm. These are

inversely proportional quantities. Tagged terms method were better than the other

three methods. There were no noticeable change between the WebKB dataset

classes. Classification performance of these classes were similar. But Conference

dataset was different from the others. The Conference dataset’s classification

performance was lower than the others, because of the content of the Web pages.

The Conference dataset does not include clear information about the class of the

Web pages. When we analyzed the best arff files (i.e. which has the maximum F-

measure value) which belong to the tagged terms method, most popular tag is

observed as the <p> tag (i.e. text tag). For example, in the Faculty class, 38 of the

selected 50 features belong to the <p> tag, 5 of them belong to <a> tag, 2 of them

belong to the <h1> tag and 5 of them belong to the URL. Distribution of the

selected features with respect to tags are shown in Table 4.43 for the Faculty class


59

with 15% document frequency values. We can say that text tag is more

meaningful than the other tags.

Table 4.43. Distribution of Selected Features With Respect to Tags for Faculty Classes When 15% Document Frequency is applied

# of Features URL <title> <h1> <a> <b> <p> <li>

10 2 0 0 0 0 8 0

30 5 0 0 1 0 23 1

50 5 0 2 5 0 38 0

According to our arff file analysis for 15% document frequency value with

maximum number of selected features, in the Conference class, 52 of the selected

70 features belong to the <p> tag, 12 of them belong to <a> tag, 1 of them belong

to the <h1> tag, 1 of them belong to the <title> tag, 2 of them belong to the <b>

tag and 2 of them belong to the URL. In the Project class, 20 of the selected 30

features belong to the <p> tag, 4 of them belong to <a> tag and 6 of them belong

to the URL. In the Student class, 20 of the selected 30 features belong to the text

tag, 5 of them are belongs to <a> tag, 2 of them belong to the <title> tag and 3 of

them belong to the URL. In the Course class, 38 of the selected 60 features belong

to the text tag, 8 of them belong to <a> tag, 4 of them belong to the <h1> tag, 1 of

them belong to the <title> tag, 3 of them belong to the <b> tag and 6 of them

belong to the URL. Distribution of the selected features with respect to tags for

the best cases can be seen in Table 4.44. We can say that text contend of Web

pages is more meaningful than other tags.

Table 4.44. Distribution of the Selected Features With Respect to Tags for the Best Cases

Class # of Features URL <title> <h1> <a> <b> <p>

Conference 70 2 1 1 12 2 52

Project 30 6 - - 4 - 20

Student 30 3 2 - 5 - 20

Course 60 6 1 4 8 3 38


60

4.5. Comparison With C4.5

In this section, performance of the proposed ACO feature selection

algorithm is compared with pure C4.5 classifier. For this purpose, F-measure

values of the C4.5 classifier with and without the proposed ACO based feature

selection are computed and compared. This comparison is made for the

Conference dataset only because in our previous experiments lower F-measure

values were obtained with this dataset. As future work we plan to repeat this

experiment for the WebKB dataset.

The results of this experiment can be seen in Table 4.45. In this

experiment for the URL and <title> tag methods whole feature set of the

Conference dataset are taken, and for the tagged terms and bag of terms methods

5% document frequency feature selection method is used to reduce the features to

make Weka to work with them. As seen in Table 4.45., the proposed ACO feature

selection algorithm improves classification performance for Tagged Terms, Bag-

of Terms and <title> tags methods. Run time of classification reduced by ACO

feature selection algorithm. When number of features was reduced, run time of

classification was reduced with respect to number of selected features.

Table 4.45. Comparison of the Proposed ACO Feature Selection Algorithm with C4.5

Method With ACO

Feature Selection

Without ACO

Feature Selection

Tagged Terms 5%

Document Frequency

F-measure 0.998 0.998

Run Time 0.22 sec 1.42 sec

Bag-of Terms 5%

Document Frequency

F-measure 0.994 0.991


URL F-measure 0.835 0.857


<title> F-measure 0.741 0.715



61

4.6. Comparison of the Proposed Method With Earlier Studies

In this section, the proposed method is compared with the earlier studies.

URL tag features are used in Kan and Thi (2005)’s studies. They used

sequential n-grams to derive features from the URL, and selected features

classified with Maximum Entropy and Support Vector Machine classification

algorithms separately. Their average F-measure value is reported as 0.525. This

result belongs to multi-class classification. Our average F-measure value was 1.0

for WebKB dataset with URL-only method in binary class classification. Based

upon these results, we can say that, binary class classification is more suitable

than multi-class classification for WebKB data set, our proposed ACO-based

algorithm has better classification performance than Kan and Thi (2005)’s

method.

In Özel (2010)’s study, tagged terms features are used with a GA based

classifier. URL addresses are not used in feature extraction step of Özel (2010)’s

study. Average F-measure value is reported as 0.9 for the Course class of the

WebKB dataset, and average F-measure value is reported as 0.7 for the Student

class of the WebKB dataset. In our proposed method, average F-measure value is

measured as 1.0 for the Course class, and average F-measure value is measured as

1.0 for the Student class. This comparison shows that, URL tags effects

classification performance positively.

Jiang (2010) proposed a text classification algorithm that combines a k-

means clustering scheme with an Expectation Maximization (EM) variation, and

it can learn from a very small number of labeled samples and a large quantity of

unlabeled data. Jiang (2010)’s experimental results show that, the average F-

measure value is 0.7 for WebKB dataset in multi-class classification. This results

shows that, our ACO-based algorithm performs better than k-means clustering

algorithm.

Joachims (1999) used Transductive Support Vector Machines on WebKB

dataset with binary class classification. Bag-of terms method is used in Joachims

(1999)’s study. According to the experimental results of this study, average F-

measure values are reported as 93.8, 53.7, 18.4 and 83.8 for the Course, the


62

Faculty, the Project and the Student classes respectively. Also, these results show

that, the proposed ACO-based algorithm is better than the SVM algorithm.

5. CONCLUSION Esra SARAÇ

63

5. CONCLUSION

In this thesis we have developed an ACO-based Web page classification

system which uses HTML tags and terms pairs as classification features. In our

system, ants learns the optimal features by the ACO and experimental evaluation

shows that, using tagged-terms as features increases classification performance

with respect to using bag-of terms or URL alone or <title> tag alone. In addition

to tag of features, document frequency value is important for classification

performance. Experimental evaluation shows that, 15% document frequency value

is acceptable. The proposed system is effective on reducing the number of features

so, it is suitable for classification of any number of features.

As future work, we plan to study the weight of tags on the accuracy of our

ACO-based classifier system in more detail. Tags can be weighted with respect to

their importance, and this method can improve the performance of the classifier.

5. CONCLUSION Esra SARAÇ

64

65

REFERENCES

AGHDAM, M. H., GHASEM-AGHAEE, N., and BASIRI, M. E., 2009. Text

Feature Selection Using Ant Colony Optimization. Expert Systems with

Applications 36: 6843-6853.

BÄCK, T., 1996. Evolutionary Algorithms in Theory and Practice. Oxford

University Press. New York, USA. 328 p.

BAEZA-YATES, R., and RIBEIRO-NETO, B., 1999. Modern Information

Retrieval. Addison-Wesley ACM Press. Harlow, England. 513 p.

BAYKAN, E., HENZINGER, M., MARIAN, L., and WEBER, I., 2009. Purely

URL-based Topic Classification. International World Wide Web

Conference . Madrid, Spain. 1109-1110.

BONABEAU, E., DORIGO, M., and THERAULAZ, G., 1999. Swarm Intelligence:

From Natural To Artificial Systems. First Edition. Oxford University

Press. New York, USA. 320 p.

BOUGHANEM, M., CHRISMENT, C., and TAMINE, L., 1999. Genetic Approach

to Query Space Exploration. Journal of Information Retrieval 1(3): 175-

192.

BLUM, A., and MITCHELL, T., 1998. Combining Labeled and Unlabeled Data

With Co-training. In COLT’ 98: Proceedings of the 11th Annual

Conference on Computational Learning Theory, New York, USA. 92-

100.

CHAKRABARTI, S., 2002. Mining the Web: Discovering Knowledge from

hypertext data. First Edition. Morgan Kaufmann Press. San Francisco,

USA. 344 p.

CHAKRABARTI, S., VAN DEN BERG, M., and DOM, B., 1999. Focused

Crawling: A New Approach to Topic-specific Web Resource Discovery.

Computer Networks, 31(11-16): 1623-1640.

66

CHEKURİ, C., GOLDWASSER, M., RAGHAVAN, P., and UPFAL, E., 1997. Web

Search Using Automated Classification. In Proceedings of the Sixth

International World Wide Web Conference, Santa Clara, CA. Poster

POS725.

CHEN, H., and KIM, J., 1995. GANNET: A Machine Learning Approach to

Document Retrieval. Journal of Management Information Systems -

Special section: Information technology and IT organizational impact.

New York, USA. 11(3): 7-41.

COVER, T. M., and THOMAS, J. A., 1991. Elements of Information Theory. 99th

Edition. Wiley-Interscience Press. 542 p.

DBLP Web Site, http://www.informatik.uni-trier.de/~ley/db/

DORIGO, M., 1992. Optimization, Learning and Natural Algorithms. Ph.D.Thesis,

Politecnico di Milano, Italy.

DORIGO, M., DI CARO, G., and GAMBARDELLA, L. M., 1999. Ant Algorithms

for Discrete Optimization. Artificial Life, 5(2):137-172.

DORIGO, M., MANIEZZO, V., and COLORNI, A., 1991. Positive Feedback as a

Search Strategy. Technical Report No. 91-016. Politecnico di Milano,

Italy.

DORIGO, M., MANIEZZO, V., and COLORNI, A., 1996. Ant System:

Optimization by A Colony Of Cooperating Agents. IEEE Transaction on

Systems, Man, and Cybernetics-Part B, 26(1): 29-41.

GHANI, R., 2001. Combining Labeled And Unlabeled Data For Text Classification

With A Large Number of Categories. In First IEEE International

Conference on Data Mining (ICDM), Los Alamitos, CA. 597.

GHANI, R., 2002. Combining Labeled And Unlabeled Data For Multiclass Text

Categorization. In ICML ’02: Proceedings of the 19th International

Conference on Machine Learning, San Francisco, CA. 187-194.

GOOGLE, www.google.com.

GORDON, M., 1988. Probabilistic and Genetic Algorithms in Document Retrieval.

Communications of the ACM. 31(10): 1208-1218.

http://www.informatik.uni-trier.de/~ley/db/

http://www.google.com

67

GUIASU, S., 1977. Information Theory with Applications. First Edition. McGraw-

Hill Press. New York, USA. 439 p.

HAN, J.,AND KAMBER, M., 2006. Data Mining: Concepts and Techniques. Second

Edition. Morgan Kaufmann Publishers. 550 p.

HAVELIWALA, T., KAMVAR, S., and JEH, G., 2003. An analytical comparison of

approaches to personalizing PageRank. Stanford University technical

report. Available at:

http://infolab.stanford.edu/~taherh/papers/comparison.pdf.

HAYKIN, S., 1999. Neural networks - A Comprehensive Foundation. Second

Edition. Prentice Hall. 842 p.

HOLDEN, N., and FREITAS, A. A., 2004. Web Page Classification With An Ant

Colony Algorithm. Parallel Problem Solving from Nature, 8 (LNCS

3242): 1092-1102.

HUANG, C. C., CHUANG, S. L., and CHIEN, L. F., 2004. Liveclassifier: Creating

Hierarchical Text Classifiers Through Web Corpora. In WWW ’04:

Proceedings of the 13th International Conference on World Wide Web,

New York, USA. 184-192.

JIANG, E. P., 2010. Learning to Integrate Unlabeled Data in Text Classification.

Computer Science and Information Technology (ICCSIT), 3rd IEEE

International Conference on. 9: 82-86.

JOACHIMS, T., 1999. Transductive Inference for Text Classification using Support

Vector Machines. Proceedings of the 16th International Conference on

Machine Learning. 200-209.

KAN, M.-Y., 2004. Web Page Classification Without The Web Page. In WWW Alt.

’04: Proceedings of the 13th International World Wide Web Conference

Alternate Track Papers & Posters, New York, USA. 262-263.

KAN, M.-Y., and THI, H. O. N., 2005. Fast Webpage Classification Using URL

Features. In Proceedings of the 14th ACM International Conference on

Information and Knowledge Management (CIKM ’05). New York, USA.

325-326.

http://infolab.stanford.edu/~taherh/papers/comparison.pdf

68

KIM, S., and ZHANG, B.T., 2003. Genetic Mining Of HTML Structures For

Effective Web Document Retrieval. Applied Intelligence. 18: 243-256.

KWON, O.-W., and LEE, J.-H., 2000. Web Page Classification Based on k-Nearest

Neighbor Approach. In IRAL ’00: Proceedings of the 5th International

Workshop on Information Retrieval with Asian languages, New York,

USA. 9-15.

KWON, O.-W., and LEE, J.-H., 2003. Text Categorization Based on k-Nearest

Neighbor Approach for Web Site Classification. Information Processing

and Management. 29(1): 25-44.

LIANGTU, S., and XIAOMING, Z., 2007. Web Text Feature Extraction with

Particle Swarm Optimization. International Journal of Computer Science

and Network Security. 7(6): 132-136.

LIU, H., and HUANG, S., 2003. A Genetic Semi-Supervised Fuzzy Clustering

Approach to Text Classification. Lecture Notes in Computer Science

2762.173-180.

MENCZER, F., and BELEW, R. K., 1998. Adaptive Information Agents in

Distributed Textual Environments. In Proc. 2nd International Conference

on Autonomous Agents, Minneapolis.

MITCHELL, T. M., 1997. Machine Learning. First Edition. McGraw-Hill. New

York. 432 p.

MLADENIC, D., BRANK, J., GROBELNIK M., and MILIC-FRAYLING, N., 2004.

Feature Selection Using Support Vector Machines. The 27th Annual

International ACM SIGIR Conference. 234-241.

ODP, Open Directory Project. Available at: http://www.dmoz.org.

ÖZEL, S. A., 2010. A Web Page Classification System Based On A Genetic

Algorithm Using Tagged-Terms As Features. Expert Systems with

Applications. doi:10.1016/j.eswa.2010.08.126.

ÖZEL, S. A., and SARAÇ, E., 2008. Focused Crawler for Finding Professional

Events Based On User Interests. In: Proceedings of the 23rd of the

International Symposium on Computer and Information Sciences ISCIS.

Istanbul, Turkey. 441-444.

http://www.dmoz.org

69

PAGE, L., and BRIN, S., 1997. PageRank: Bringing Order to the Web. Available at:

http://web.archive.org/web/20020506051802/www-

diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1997-0072?1

PAREPINELLI, R. S., LOPES H. S., and FREITAS A., 2002. An Ant Colony

Algorithm for Classification Rule Discovery. IEEE Transactions on

Evolutionary Computation. 6(4): 321-332.

PERL Programming Language, http://www.perl.org/

POGGIO, T. and GIROSI, F., 1990. Networks For Approximation And Learning.

Proc. IEEE 78(9): 1484-1487.

PORTER, M.F. ,1980. An Algorithm For Suffix Stripping. Program. 14(3):130-137.

REUTERS Dataset, http://archive.ics.uci.edu/ml/databases/reuters

RIBEIRO, A., FRESNO, V., GARCIA-ALEGRE, M.C., and GUINEA, D., 2003.

Web Page Classification: A Soft Computing Approach. Lecture Notes in

Artificial Intelligence 2663. 103-112.

QI, X., and DAVISON, B. D., 2009. Web Page Classification: Features and

Algorithms, ACM Computing Surveys 41(2): Article 12.

QUINLAN, J. R., 1993. C4.5: Programs for Machine Learning. First Edition.

Morgan Kaufmann Publishers. San Mateo, California. 302 p.

SALTON, G., 1970. Automatic Text Analysis. Science. 168: 335-343.

SALTON, G., and BUCKLEY, C., 1988. Term-weighting Approaches In Automatic

Text Retrieval. Inform. Process. Man. 24(5): 513-523.

SARAÇ, E., and ÖZEL, S. A., 2010. URL Tabanlı Web Sayfası Sınıflandırma. Akıllı

Sistemlerde Yenilikler Ve Uygulamaları Sempozyumu. 1: 13-18

SHANG, W. , HUANG, H., ZHU, H., LİN, Y., QU, T., WANG, Z., 2007. A Novel

Feature Selection Algorithm For Text Categorization. Expert Systems

with Applications: An International Journal. 33(1): 1-5

SHANNON, C. E., 1948. A Mathematical Theory of Communication. Bell System

Technical Journel. 27: 379-423, 27: 623-656.

SHEN, D., CHEN, Z., YANG, Q., ZENG, H. J., ZHANG, B., LU, Y., and MA, W.

Y., 2004. Web-Page Classification Through Summarization. In SIGIR

’04: Proceedings of the 27th Annual International ACM SIGIR

http://web.archive.org/web/20020506051802/www

http://www.perl.org/

http://archive.ics.uci.edu/ml/databases/reuters

70

Conference on Research and Development in Information Retrieval. New

York, USA. 242-249.

STÜTZLE, T., and HOOS, H., 2000. Max-Min Ant System. Journal of Future

Generation Computer Systems. 16: 889-914.

TREC Dataset, http://trec.nist.gov/data.html

TROTMAN, A., 2005. Choosing Document Structure Weights. Information

Processing & Management 41(2): 243-264.

VAN RIJSBERGEN, C. J., 1979. Information Retrieval. Second Edition.

Butterworth-Heinemann Publishers. London, UK. 224 p.

WANG, Z., ZHANG, Q., and ZHANG, D., 2007. A PSO-based Web Document

Classification Algorithm. In Proc. the Eighth ACIS International

Conference on Software Engineering, Artificial Intelligence, Networking,

and Parallel/Distributed Computing. 659-664.

WebKB, CMU World Wide Knowledge Base (Web->KB) project. Available at:

http://www.cs.cmu.edu/~webkb/

WEKA, Data Mining Software in Java. Available at:

http://www.cs.waikato.ac.nz/~ml/weka/

WILBUR, W. J., SIROTKIN, K, 1992. The Automatic İdentification of Stop Words.

Journal of Information Science, 18(1): 45

WIKIPEDIA, http://en.wikipedia.org/wiki/Ant_colony_optimization

WITTEN, I. H., FRANK, E., 2005. Data Mining: Practical Machine Learning Tools

and Techniques. 2nd Edition. Morgan Kaufmann, San Francisco.

YAHOO!, http://www.yahoo.com

YANG, Y., 1995. Noise Reduction in a Statistical Approach to Text Categorization.

In Proceedings of the 18th Ann Int ACM SIGIR Conference on Research

and Development in Information Retrieval. 256-263.

YANG, Y., and PEDERSEN, J. O., 1997. A Comparative Study On Feature

Selection In Text Categorization. Proc. of ICML. 412-420.

YANG, Y., WILBUR, W. J., 1996. Using Corpus Statistics to Remove Redundant

Words in Text Categorization. In J Amer Soc Inf Sci.

http://trec.nist.gov/data.html

http://www.cs.cmu.edu/~webkb/

http://www.cs.waikato.ac.nz/~ml/weka/

http://en.wikipedia.org/wiki/Ant_colony_optimization

http://www.yahoo.com

71

YU, E. S., and LIDDY, E. D., 1999. Feature Selection in Text Categorization Using

the Baldwin Effect. Proceedings of International Joint Conference on

Neural Networks. Washington DC.

YU, H., HAN, J,. and CHANG, K. C.-C., 2004. PEBL: Web Page Classification

Without Negative Examples. IEEE Transactions on Knowledge and Data

Engineering. 16 (1): 70-81.

73

CURRICULUM VITAE

Esra Saraç was born in İskenderun, in 1986. She has completed her

elementary education at İskenderun Demirçelik Primary Education School. She went

to high school at İskenderun Demirçelik Anatolian High School. Then she deserved

to educate in Niğde Science School. She has completed university education at

department of Computer Engineering of Çukurova University in 2008. Since 2008,

she has been working as a research assistant at Computer Engineering Department of

Çukurova University in Adana.

Çukurova university institute of natural and applied ... · bu çalışmada, web sayfaları...

Documents