a survey- link algorithm for web mining - international journal of

A SURVEY- LINK ALGORITHM FOR WEB MINING

Gurpreet Kaur 1M.TECH, Research Scholar, Dept. of CSE,

S.G.G.S.W.U., Fatehgarh Sahib (Punjab),India

[email protected]

Shruti Aggarwal 1M.TECH, Assistant Professor, Dept. of CSE,

S.G.G.S.W.U., Fatehgarh Sahib (Punjab), India [email protected]

Abstract- Web mining is the most active area where the research is going on rapidly. Web mining is

the integration of information gathered by traditional data mining methodologies and techniques

with information gathered over the World Wide Web. Based on the information gathered over the

WWW web mining is categorized into three: Web content mining, Web structure mining and Web

usage mining. In search engines web mining application can be seen. Most of the search engines are

ranking their search results in response to user’s queries to make their search navigations easier. In

this paper we give a survey of page ranking algorithms and description about Weighted Page

Content Rank (WPCR) based on web content mining and structure mining that shows the relevancy

of the pages to a given query is better determined, as compared to the Page Rank and Weighted Page Rank algorithms.

Keywords- Web mining, web content, Page rank, Weighted Page rank, weighted page content rank, web

structure.

I INTRODUCTION

The World Wide Web is the collection of

information resources on the Internet that are using the Hypertext Transfer Protocol. It

is a repository of many interlinked hypertext

documents, accessed via the Internet. Web

may contain text, images, video and other

multimedia data. In order to analyze such

data, some techniques called web mining

techniques are used by various web

applications and tools. Web mining

describes the use of data mining techniques

to automatically discover Web documents

and services, to extract information from the

Web resources and to uncover general

patterns on the Web. Over the years, Web

mining research has been extended to cover

the use of data mining and similar

techniques to discover resources, patterns,

and knowledge from the Web-related data (such as Web usage data or Web server

logs).It is used to understand customer

behavior, evaluate the effectiveness of a

particular Web and help quantify the success

of a marketing campaign. It is a rapidly

growing research area.

II Web Mining

In 1996 it’s Etzioni who first coined the

term web mining. Etzioni starts by making a

hypothesis that information on web is

sufficiently structured and outliers the

subtasks of web mining.[5].It refers to

overall process of discovering potentially

useful and previously unknown information

from web document and services web

mining could be viewed as an extension of

standard data mining to web data.

Fig.4. Taxonomy of web mining [8].

2.1.Web Mining Process

The complete process of extracting

knowledge from Web data is as follows:

Gurpreet Kaur et al , International Journal of Computer Science & Communication Networks,Vol 3(2), 105-110

105

ISSN:2249-5789

Web mining can be decomposed into the

subtasks, namely:

Resource finding:- the task of retrieving

intended Web documents. By resource

finding we mean the process of retrieving

the data that is either online or offline from

the text sources available on the web such as

electronic newsletters, electronic newswires,

the text contents of HTML documents

obtained by removing HTML tags, and also

the manual selection of Web resources.

Information selection and pre-

processing:- automatically selecting and

preprocessing specific information from

retrieved Web resources. It is a kind of

transformation process of original data

retrieved in the IR process. These

transformations could be either a kind of

pre-processing aimed at obtaining the

desired representation such as finding

phrases in the training corpus, transforming

the representation to relational or first order

logic form, etc.

Generalization:- automatically discovers

the general patterns at individual Websites

as well as across multiple sites. Machine

learning or data mining techniques are

typically used in the process of generalization. Humans play an important

role in the information or knowledge

discovery process on the Web since the Web

is an interactive medium.

Analysis:- validating and/or interpretation

of the mined patterns.

A. Web Content Mining

Web Content Mining [9] deals with

discovering useful information or

knowledge from web page contents. Web

content mining analyzes the content of web

resources. Content data is the collection of

facts that are contained in a web page. It

consists of unstructured data such as free

texts, images, audio, video, semi structured

data such as HTML documents and a more

structured data such as data in tables or

database generated HTML pages. The

primary web resources that are mined in

web content mining are individual pages.

They can be used to group, categorize,

analyze and retrieved documents.

B. Web Structure Mining

Web structure Mining[10] is the process of

discovering structure information from the

web. The structure of a typical web graph

consists of web pages as nodes and

hyperlinks as edges connecting related

pages.

Fig.5. Web Graph Structure.[12].

a.) Hyperlinks: A hyperlink is a

structural unit that connects a location in a

web page to a different location, either

within the same web page or on a different

web page. A hyperlink that connects to a

different part of the same page is called an

intra-document hyperlink, and a hyperlink

that connects two different pages is called an

inter-document hyperlink.

b.) Document structure: In addition,

the content within a web page can also be

organized in a tree structured format, based

on the various HTML and XML tags within

the page. Mining efforts here have focused

on automatically extracting document object


106

ISSN:2249-5789

model structures out of documents.

C. Web Usage Mining

Web usage mining[11] is the process of

finding out what users are looking for on the

internet. Web usage mining focuses on the

techniques that could predict the behavior of

users while they are interacting with the

WWW. It collects the data from web log

records to discover user access patterns of

web pages. Usage data captures the identity

or origin of web users along with their

browsing behavior at a website. There are

two main tendencies in web usage mining

driven by the application of the discoveries:

General Access Pattern Tracking and

Customized Usage Tracking. The general

access pattern tracking analysis the web logs

to understand access patterns and trends. Its

purpose is to customized websites to the

users.

III Link Analysis Algorithms

Web mining technique provides the additional information through hyperlinks

where different documents are connected.

We can view the web as a directed labeled

graph whose nodes are the documents or

pages and edges are the hyperlinks between

them. This directed graph structure is known

as web graph. There are number of

algorithms proposed based on link analysis.

Three important algorithms Page Rank,

Weighted Page Rank and Weighted Page

Content Rank are discussed below:

A. Page Rank

This algorithm was developed by Brin and

Page Stanford University which extends the

idea of citation analysis. In citation analysis

the incoming links are treated as citation but

this technique could not provide fruitful

results because this gives some

approximation of importance of page. So,

page provides a better approach that can

compute the importance of web page by

simply counting the number of pages that

are linked to it. These links are called

backlinks. If a backlinks comes from an

important page than this link is given higher

weightage than those which are coming

from non important pages. The link from

one page to another page is considered as a

vote. Not only the number of votes that a

page receives is important but the

importance of pages that casts the vote is

also important. Page and Brin proposed a

formula to calculate the page rank of a page

A as stated below:

PR(A)=(1-d)+d(PR(T1)/C(T1)+…..+PR(Tn/C(Tn))

Here PR(Ti) is the page rank of the pages Ti

which links to page A, C(Ti) is number of

outlinks on page Ti and d is damping factor.

It is used to stop other pages having too

much influence. The total vote is “damped

down” by multiplying it to 0.85.

The page rank forms a probability

distribution over the web pages so the some

of page ranks of all web pages will be one.

The page rank of a page can be calculated

without knowing the final value of page

rank of other pages. It is an interactive

algorithm which follows the principle of

normalized link matrix of web. Page rank of

a page depends on the number of pages

pointing to a page.

B. Weighted Page Rank

The more popular web pages are the more

linkages that other web pages tends to have

to them or are linked to by them. The

proposed extended page rank algorithm –a

weighted page rank algorithm assigns larger

rank values to more important pages instead

of dividing the rank value of a page evenly

among its outlink pages. Each out link page

gets a value proportional to its popularity.

The popularity from the number of inlinks

and out links are recorded as Win(v,u) and

Wout

(v,u) respectively. Win(v,u) is the

weight of link(v,u) calculated based on the

number of in links of page u and the number

of inlinks of all reference pages of page v.


107

ISSN:2249-5789

Iu

Win

(v,u) = ∑p € R(v) Ip

Where Iu and Ip represent the number of in

links of pages u and page p, respectively.

R(v) denotes the reference page list of page

v. Wout

(v,u) is the weight of link(v,u)

calculated based on the number of out links

of page u and the number of out links f all

reference page of page v.

Wout

(v,u) = Ou

∑ p € R(v) Op

Where Ou and Op represent the number of

outlinks of the page u and page p,

respectively. R(v) denotes the reference

page list of page v.

C. Weighted Page Content Rank

Weighted Page Content Rank Algorithm is

a proposed page ranking algorithm which is

used to give a sorted order to the web pages

returned by a search engine in response to a

user query. WPCR is a numerical value

based on which the web pages are given an

order. This algorithm employs web structure

mining as well as web content mining

techniques. Web structure mining is used to

calculate the importance of the page and

web content mining is used to find how

much relevant a page is? Importance here

means the popularity of the page i.e. how

many pages are pointing to or are referred

by this particular page. It can be calculated

based on the number of in links and out

links of the page. Relevancy means

matching of the page with the fired query. If

a page is maximally matched to the query

that becomes more relevant.

Algorithm of Weighted Page Content

Rank

Input: Page P, Inlink and Outlink Weights

of all backlinks of P, Query Q, d (damping

factor).

Output: Rank score

Step 1: Relevance calculation:

a) Find all meaningful word strings of Q

(say N)s

b) Find whether the N strings are occurring

in P or not? Z= Sum of frequencies of all N strings.

c) S= Set of the maximum possible strings

occurring in P.

d) X= Sum of frequencies of strings in S.

e) Content Weight(CW)= X/Z

f) C= No. of query terms in P

g) D= No. of all query terms of Q while

ignoring stop words.

h) Probability Weight(PW)= C/D

Step 2: Rank calculation:

a) Find all backlinks of P (say set B).

b)PR(P)=(1-d)+d[ PR(V) W

in(P,V)W

out(P,V) ](CW+PW)

c) Output PR(P) i.e. the Rank score

Comparison of Algorithms Table shows the difference between above

three algorithms:

Table : Comparison of Page Rank, Weighted

Page Rank and Weighted Page Content

Rank


108

ISSN:2249-5789

IV Conclusion Web mining is the Data Mining technique

that automatically discovers or extracts the

information from web documents. Page

Rank and Weighted Page Rank algorithms

are used in Web Structure Mining to rank

the relevant pages.In this paper we focused

on comparitative study of page rank Algorithms .By using Page Rank and

Weighted Page Rank algorithms users may

not get the required relevant documents

easily, but in new algorithm Weighted Page

Content Rank user can get relevant and

important pages easily as it employs web

structure mining and web content mining.

As part of our future work, we are planning

to implement the Weighted Page Content

Rank algorithm and integrate it with

clustering algorithm and working on

finding required relevant and important

pages more easily.

REFERENCES

[1] Tipawan Silwattananusarn and Assoc.Prof.Dr.Kulthida Tuamsuk,

International journal of Data Mining and

Knowledge Management Process “Data

Mining and its applications for knowledge

Management: A Literature Review from

2007 to 2012.” Volume 2, no. 5,

Sept.2012

[2] NeelamadhabPadhy, Dr. Pragnyaban

Mishra, and Rasmita Panigrahi. “The

Survey of Data Mining Apllications and

Feature Scope” Volume 2, No.3, June2012.

[3] Venkatadri. M Research Scholar, “ A

Review on Data Mining from Past to the

Future” Vol. 15- No.7, Feb.2011. [4] Osmar R. Zaiane “Introduction to data

`mining”, 1999.

[5] Chintandeep kaur, Rinkle Rani

Aggarwal “Web Mining Tasks and Types:

A Survey” Vol 2, Issue 2, Feb 2012

[6]N. Senthil Kumar, P.M. Durai Raj

Vincent “ Web Mining An Integrated

Approach” Vol 2, Issue 3,March 2012

[7] G T Rajul and P S Satyanarayana

“Knowledge Discovery from Web usage

Data: Complete Preprocessing

Methodology “ IJCSNS International

Journal of Computer Science and Network

security, vol. 8, No. 1, January 2008

[8] Shakti Kundu, International Journal of

Computer Science and Engg (IJCSE) “ An

intelligent approach of Web data mining” Vol. 4, No. 5, May 2012

[9] Boley D, Gini ML, Gross R, Han EH,

Hastings K, Karypis G, Kumar V,

Mobasher B, Moore J. Document

categorization and query generation on the

world wide web using WebACE. J Artif

Intell Rev 1999; 13(5-6):365-91.

[10] Pirolli P, Pitkow J, Rao R, Silk from a

sow’s ear: extracting usable structures from

the web.In Proceedings of conference on

Human Factors in computing systems,

Vancover, British Columbia, Canada 1996;

1996:118-25

[11] Masseglia F, Poncelet P, Cicchetti R.

An efficient algorithm for web usage

mining. J Networking Inf Syst 1999; 2(5-

6): 571-603

[12] Taher H. Haveliwala, “Topic-Sensitive

Page Rank: A Context-Sensitive Ranking

Algorithms for Web Search”, IEEE

transactions on Knowledge and Data

Engineering Vol 15, No 4, July/August

2003.

[13] Tamana Bhatia, “ Link Analysis

Algorithms For Web Mining” IJCST Vol. 2,

Issue 2, June 2011.

[14] Rekha Jain, Dr. G. N. Purohit “ Page

Ranking Algorithms for Web Mining”

International Journal of Computer

Applications Vol. 13- No.5, Jan 2011.

[15] Neelam Tyagi, Simple Sharma,

International journal of Soft Computing and

Engineering “Weighted Page rank algorithm

based on number of visits of Links of web

page” Vol-2, Issue-3, July 2012.


109

ISSN:2249-5789

[16] N. Duhan, A.K.Sharma and Bhatia

K.K, “Page Ranking Algorithms, A Survey,

Proceedings Of the IEEE International

Conference on Advance Computing, 2009,

978-1-4244-1888-6.

[17] Pooja Sharma, Deepak Tyagi, Pawan

Bhadana, International journal of

Engineering Science and Technology “Weighted Page Content Rank for ordering

Web Search Result”, Vol 2(12) 2010, 7301-

7310.

[18] S.Chakrabarti et al., “Mining the Web’s

Link Structure” Computer, 32(8):60-67,

1999.

[19] Raymond Kosala, Hendrik Blockee,

“Web Mining Research: A Survey”, CAN

Sigkdd Explorations Newsletter, June 2000,

Volume 2.

[20] Cooley, R., Mobasher, B., Srivastava, J,

“Web Mining: Information and pattern

discovery on the World Wide Web”. In

proceedings of the 9th IEEE International

conference on Tools with Artificial

Intelligence (ICTAI’ 97), Newport Beach,

CA, 1997. [21] Companion slides for the text by Dr. M.

H. Dunham, “Data Mining: Introductory and

Advanced Topics”, Prentice Hall, 2002.

[22] Wenpu Xing and Ali Ghorbani,

“Weighted Page Rank Algorithm”,

Proceedings of the Second Annual

Conference on Communication Networks

and Services Research (CNSR’04), 2004

IEEE.


110

ISSN:2249-5789

a survey- link algorithm for web mining - international journal of

Documents