[ieee 2013 ieee recent advances in intelligent computational systems (raics) - trivandrum, india...

6
Clustering of Web Sessions by FOGSAA Angana Chakraborty Machine Intelligence Unit Indian Statistical Institute Kolkata, India Email: angana [email protected] Sanghamitra Bandyopadhyay Machine Intelligence Unit Indian Statistical Institute Kolkata, India Email: [email protected] Abstract—Clustering of the web sessions to identify the vis- itors’ choices while browsing the web pages, is an important problem in web mining. The sequence of pages viewed by the user in a particular time-frame, i.e., the session, captures his/her interest in a specific topic. Clustering of these sessions is therefore needed to provide customized services to the users having similar interests. In this article, we propose a novel and accurate similarity measure, Psim, between two web pages and a method of clustering the web sessions using a recently developed Fast Optimal Global Sequence Alignment Algorithm (FOGSAA). FOGSAA is an optimal global alignment algorithm which is used to align the pairs of sessions. It computes the pair-wise distances, which is used to cluster the sessions in similar groups. FOGSAA aligns the sessions in much less time and results in an average time gain of 35.84% over the conventional dynamic programming based Needleman-Wunsch’s method, where both are generating the same optimal alignment. Therefore, application of FOGSAA to align the sessions makes the procedure faster and at the same time maintains the quality. I. I NTRODUCTION In web mining literature, a web session is defined as the sequences of web pages viewed by a particular user within a specific time span. The interest or intention of the user is reflected in the sequence of pages that he/she visits in a web site. Therefore, the users having common interest will defi- nitely have a similar ordering pattern of their accessed pages in the corresponding sessions. In today’s era of technological advancement, people are showing more and more interest in web-based communications, marketing and entertainment. Attractive web sites and the advantages of getting everything through web, while sitting at home, have already captured the attention of customers for online shopping, e-business, e- learning, etc. The web masters analyze the previous logs of user access and extract information to provide personalized services and suggestions [1]. Web mining techniques, such as those reported in [2], [3], [4], [5], have been developed to predict and suggest some similar options that the user may like, after investigating his/her navigation pattern in the site. Clustering of web sessions is one such technique which groups sessions based on the access pattern and discovers the pages of particular interest. Although significant research as mentioned in [6], [7], has already been done to extract the user activity pattern from the web log, the problem is not totally resolved. The huge data size of the log files still demands for a time-efficient as well as accurate algorithm to predict user interactions. The well known clustering methods are specially suitable for numeric data, while the user access data obtained from the web log file are categorical in nature. The common distance measures like Euclidian distance and vector angle similarity metric will not work here. In this article, we propose a novel similarity mea- sure P sim which computes the page level similarity efficiently. Most of the existing session clustering techniques like [8], treat the sessions as unordered set of accessed pages, where the similarity is usually measured by cosine factor or the Jaccard coefficient. However, these methods fail to grasp the important ordering information of the click-stream. The user visiting the page A for a significant amount of time after browsing the page B surely has some different interest pattern from the one who visits page B after page A. This problem was addressed in [9], [10], and [11], where sessions were viewed as the sequences of hyperlinks. They used the conventional dynamic programming based sequence alignment techniques like Needleman-Wunsch (NW) [12], GAP3 [13] to compute the session similarity for clustering. But, these methods are time expensive, as they take O(m n) time to compute the alignment, where m, n are the respective lengths of the two sequences. The LCS method proposed by [14] also has the same time requirement. As the user activity on the online applications has started to increase, developing time efficient methods to analyze these data has become a challenge today. In this article, we have used a very fast alignment technique FOGSAA[15] for clustering the web sessions. FOGSAA is a global alignment method that was originally developed for alignment of gene and/or protein sequences. In this article, we suitably modify FOGSAA to show its application in web mining. It achieves a time gain of 35.84% over NW where both the methods output the same optimal alignment. The proposed method of web session clustering using FOGSAA can be summarized as follows: 1) At first, the web log data is cleaned by removing the image, media and script file records. 2) Sessions are identified from the web log data by setting a specific time window. The time spent by the user on each page is also recorded. 3) The similarity between two web pages is computed using the proposed novel metric P sim . 4) Then, FOGSAA [15] is used to align the session pairs where the page level similarity is defined by P sim . 5) Finally, the sessions are clustered using hierarchical clustering method based on the FOGSAA distance score. The rest of the article is structured as: Section 2 describes the methods and the similarity measure in detail. In Section 3, the results are shown indicating the time efficiency and the 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) 978-1-4799-2178-2/13/$31.00 ©2013 IEEE 282

Upload: sanghamitra

Post on 23-Dec-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

Clustering of Web Sessions by FOGSAA

Angana ChakrabortyMachine Intelligence UnitIndian Statistical Institute

Kolkata, IndiaEmail: angana [email protected]

Sanghamitra BandyopadhyayMachine Intelligence UnitIndian Statistical Institute

Kolkata, IndiaEmail: [email protected]

Abstract—Clustering of the web sessions to identify the vis-itors’ choices while browsing the web pages, is an importantproblem in web mining. The sequence of pages viewed bythe user in a particular time-frame, i.e., the session, captureshis/her interest in a specific topic. Clustering of these sessionsis therefore needed to provide customized services to the usershaving similar interests. In this article, we propose a novel andaccurate similarity measure, Psim, between two web pages and amethod of clustering the web sessions using a recently developedFast Optimal Global Sequence Alignment Algorithm (FOGSAA).FOGSAA is an optimal global alignment algorithm which is usedto align the pairs of sessions. It computes the pair-wise distances,which is used to cluster the sessions in similar groups. FOGSAAaligns the sessions in much less time and results in an averagetime gain of 35.84% over the conventional dynamic programmingbased Needleman-Wunsch’s method, where both are generatingthe same optimal alignment. Therefore, application of FOGSAAto align the sessions makes the procedure faster and at the sametime maintains the quality.

I. INTRODUCTION

In web mining literature, a web session is defined as thesequences of web pages viewed by a particular user withina specific time span. The interest or intention of the user isreflected in the sequence of pages that he/she visits in a website. Therefore, the users having common interest will defi-nitely have a similar ordering pattern of their accessed pagesin the corresponding sessions. In today’s era of technologicaladvancement, people are showing more and more interestin web-based communications, marketing and entertainment.Attractive web sites and the advantages of getting everythingthrough web, while sitting at home, have already capturedthe attention of customers for online shopping, e-business, e-learning, etc. The web masters analyze the previous logs ofuser access and extract information to provide personalizedservices and suggestions [1]. Web mining techniques, such asthose reported in [2], [3], [4], [5], have been developed topredict and suggest some similar options that the user maylike, after investigating his/her navigation pattern in the site.Clustering of web sessions is one such technique which groupssessions based on the access pattern and discovers the pagesof particular interest.

Although significant research as mentioned in [6], [7], hasalready been done to extract the user activity pattern from theweb log, the problem is not totally resolved. The huge datasize of the log files still demands for a time-efficient as wellas accurate algorithm to predict user interactions. The wellknown clustering methods are specially suitable for numericdata, while the user access data obtained from the web log file

are categorical in nature. The common distance measures likeEuclidian distance and vector angle similarity metric will notwork here. In this article, we propose a novel similarity mea-sure Psim which computes the page level similarity efficiently.Most of the existing session clustering techniques like [8], treatthe sessions as unordered set of accessed pages, where thesimilarity is usually measured by cosine factor or the Jaccardcoefficient. However, these methods fail to grasp the importantordering information of the click-stream. The user visiting thepage A for a significant amount of time after browsing the pageB surely has some different interest pattern from the one whovisits page B after page A. This problem was addressed in [9],[10], and [11], where sessions were viewed as the sequences ofhyperlinks. They used the conventional dynamic programmingbased sequence alignment techniques like Needleman-Wunsch(NW) [12], GAP3 [13] to compute the session similarity forclustering. But, these methods are time expensive, as they takeO(m ∗ n) time to compute the alignment, where m, n arethe respective lengths of the two sequences. The LCS methodproposed by [14] also has the same time requirement. As theuser activity on the online applications has started to increase,developing time efficient methods to analyze these data hasbecome a challenge today.

In this article, we have used a very fast alignment techniqueFOGSAA[15] for clustering the web sessions. FOGSAA isa global alignment method that was originally developed foralignment of gene and/or protein sequences. In this article,we suitably modify FOGSAA to show its application in webmining. It achieves a time gain of 35.84% over NW whereboth the methods output the same optimal alignment. Theproposed method of web session clustering using FOGSAAcan be summarized as follows:

1) At first, the web log data is cleaned by removing theimage, media and script file records.

2) Sessions are identified from the web log data bysetting a specific time window. The time spent bythe user on each page is also recorded.

3) The similarity between two web pages is computedusing the proposed novel metric Psim.

4) Then, FOGSAA [15] is used to align the session pairswhere the page level similarity is defined by Psim.

5) Finally, the sessions are clustered using hierarchicalclustering method based on the FOGSAA distancescore.

The rest of the article is structured as: Section 2 describesthe methods and the similarity measure in detail. In Section3, the results are shown indicating the time efficiency and the

2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS)

978-1-4799-2178-2/13/$31.00 ©2013 IEEE 282

Page 2: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

accuracy of the proposed method. Finally, Section 4 concludesthe experiment with future scope.

II. PROPOSED METHOD

The sequences of pages browsed by a particular user in aspecific time window is defined as a session. If a user viewsn pages in a session si, then it can be formally defined as:

Definition 1. Session:Let si={(P1, S1, τ1), (P2, S2, τ2), . . . , (Pn, Sn, τn)}, whereeach tuple (Pj , Sj , τj) indicates the jth page Pj , the size ofthat page SJ , and the amount of time τj , spent by the user onthis page respectively.

The purpose of clustering these sessions is to providepersonalized services to the users of similar interests. In thisarticle, we have used a recently proposed sequence alignmentbased method FOGSAA [15] to compute the distance measuresbetween the pairs of sessions. A sequence is defined as astream of characters over a finite alphabet set. An alignment isa way of arranging these sequences to identify how close/farthe sequences are, and where the points of differences are. Theoptimal alignment is an alignment that yields the maximumnumber of matches and minimum number of mismatches andgaps.Formally sequence alignment can be defined as follows [16]:

Definition 2. Sequence Alignment: Let A : (a1a2 . . . am)and B : (b1b2 . . . bn) be two sequences of length m andn respectively. An alignment Aℓ(A,B) is a set of pairs{(x1, y1), (x2, y2), . . . , (xk, yk)}, such that:

1) k ≤ m+ n2) xi = aj for some 1 ≤ j ≤ m, or xi = “− ” i.e., the

gap symbol.3) yi = bj for some 1 ≤ j ≤ n, or yi = “ − ” i.e., the

gap symbol.4) for a pair (xi, yi), both xi and yi cannot be “− ”.5) If all the “ − ” characters removed from sequence

x1 . . . xk, we get A.6) If all the “ − ” characters removed from sequence

y1 . . . yk, we get B.

Definition 3. Score Function: Given an alignment Aℓ(A,B) ={(x1, y1), . . . (xk, yk)}, the score of the alignment, denoted bySC(Aℓ(A,B)), is defined as:

SC(Aℓ(A,B)) =k∑

i=1

SCxi,yi where SCxi,yi is:SCxi,yi =M if xi = yiSCxi,yi =Ms if xi ̸= yiSCxi,yi =G if xi = gap ∨ yi = gap

(1)

Here, M=Match Score, Ms=Mismatch Score and G=GapPenalty. If M = 0, Ms = −1 and G = −1, then this scoringfunction will perform the alignment computing the Levenshteindistance and the objective would be to minimize this distancescore.

Definition 4. Optimal Sequence Alignment: Optimal SequenceAlignment is an alignment Aℓ∗(A,B) for which the scoreSC(Aℓ(A,B)) is greater than or equal to all the scores ofall other possible alignments. For standard scoring schemes, it

is the alignment where there is maximum number of matchesand minimum number of mismatches and gaps. Aℓ∗(A,B) canbe formally defined as:

Aℓ∗(A,B) = argmax∀Aℓ

(SC(Aℓ(A,B))) (2)

FOGSAA [15]is an optimal global sequence alignmentalgorithm which is used to align the sessions. The normalizedscore obtained from FOGSAA is used to perform the hierarchi-cal clustering. The entire procedure is described sequentiallyin the following subsections.

A. Session Identification

A server log file contains the list of web pages thathave been accessed by different users. Each entry of thisweb log file has the format: (user-IP, date and time, theURL of the requested server page, protocol, server returncode in response, size of the page). An example entry fromthe NASA August,95 web-log [17] is like, uplherc.upl.com- - [01/Aug/1995:00:00:08 -0400] “GET /images/ksclogo-medium.gif HTTP/1.0” 200 130. A session is the sequenceof pages that have been accessed by a particular user within aspecific time. The web log file is first sorted according to theuser-IP field. This file is then processed to identify differentsessions. A new session is created, when an entry from a newIP address is found. Not only that, a session stays alive onlyfor a specific time frame even if the user IP remains same.The reason is that a user has certain motivation or intentionin mind when he/she clicks all the pages within a session. Ifthe same user starts accessing the pages after a long gap, thenhe/she is browsing with different interest. Therefore, setting thetime limit for the sessions is important, as it is responsible forcapturing user intentions. Here, in our experiment, an existingsession expires if the time gap between two successive pageaccesses is more than 30 mins.

B. Similarity Measure for Web Pages

In the alignment of sessions, the pages are arranged insuch a way that it results in maximum matches by insertingminimum gaps. Hence, first it is necessary to define a measureto compute the similarity between two web pages. In thisarticle, we define a novel page similarity measure Psim, whichis used by FOGSAA in the alignment phase. Psim has twocomponents, similarity between the two URLs of the pages i.e.,U -factor and the relation between the time spent on the pagesi.e., τ -factor. Before computing the similarity value Psim, thepage URLs from the server log are generalized by two levelswhile the minimum depth is kept at 3. Note that, the totalnumber of levels in the domain name (URL) is termed as thedepth of the URL. Let two pages Pi and Pj have the gen-eralized URLs {/hPi

1 /hPi2 /. . . /hPi

l1 } and {/hPj1 /hPj

2 /. . . /hPjl2 }

respectively. The depth of the corresponding URLs are l1and l2 respectively. Let the time spent by the user in thesetwo pages be τ1 and τ2. Then, Psim(Pi, Pj) and its twocomponents U -factor, τ -factor are defined as:

Definition 5. U -factorThe U factor of the two pages Pi and Pj is the sum of weightsfor the matched levels of the URLs before the occurrence ofthe first mismatch, starting from the top level. The root has thehighest weight which is equal to the maximum depth of the

283

Page 3: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

two URLs. It decreases by 1 for each step of the subsequentlevels. U factor can be formally defined as:

Uscore(Pi, Pj) =l′∑

k=1

(SCu(hPik , hPj

k ))/l∑

k=1

(k) (3)

where, l′ = min(l1,l2,position before the first mismatch) and l= max(l1,l2) The SCu(hPi

k , hPjk ) is defined as:

SCu(hPik , hPj

k ) =

{l − k + 1 if hPi

k = hPjk

0 otherwise(4)

Definition 6. τ -factorτ -factor will measure the significance of a page based on thetime spent by the user on this page. Let a user spend τi timeon the page Pi which has size Si. The amount of time spentper unit size, τ ′i , is defined as τ ′i = τi/Si. If the sessionhas n pages (P1, P2, . . . , Pn), then the total time per unit size,

denoted by T , is defined as: T =n∑

i=1

τ ′i . So, τ ′i/T denotes

the fraction of the total time that is spent per unit size of thepage Pi. Hence, it lies within the range (0, 1]. The term τ ′i/Tdetermines the weight or relevance of the page to the user.Higher the value of the fraction, greater is the importance ofthe page to the user. However, it may happen that the user haskept the page opened for a long time, while he/she is doingsome other work. To take care of this problem, τscore is definedas:

τscore(Pi) = (τ ′i/T )/(1 + τ ′i/T ) = 1/(1 + T/τ ′i) (5)

Fig. 1. Plot of τscore in y-axis and τ ′i/T in x-axis

As can be seen from Fig. 1, when τ ′i/T is small, say inthe range (0, 0.4], a little change in it will cause a relativelylarge change in τscore. However, when the value of τ ′i/Tincreases beyond 0.4, the rate of change of τscore gets reduced.Therefore, τscore does not give much weightage to the idlepages that are kept opened for long time. As τ ′i/T rangesbetween (0, 1], τscore function has the range within (0, 0.5].

Definition 7. Psim

Psim gives the similarity between two pages by combiningboth the URL similarity score and the access time similaritypattern. Psim is defined as:

Psim(Pi, Pj) = Uscore(Pi, Pj) ∗min(τscore(Pi), τscore(Pj))(6)

C. Session Alignment by FOGSAA

Expand the current node by selecting

the best child according to

alignment score

Start

Save the next better child in the

priority queue ,ordered according

to 'fiTness_score'

Is

it the

End of current

branch ?

yesNo

If the current branch is better than

the previous optimal,set as

the new optimal

Pick up the most promising node

from the front of

the queue

Is

it better

than the last

expanded

node?

yes

If

it c

an

giv

e >

30

% s

imilarit

y,

sta

rt

a n

ew

bran

ch

fro

m t

his

no

de

Is

the child

still

promising?

No

yes

current_node=

child_nodeco

nti

nu

e e

xp

an

din

g t

he c

urren

t b

ran

ch

None is better now,

hence output the optimal

alignment

End

No

Fig. 2. FOGSAA workflow [15]

FOGSAA [15] is an optimal global alignment method thataligns a pair of sequences defined over a finite alphabet usinga branch and bound search tree. Each complete path of theFOGSAA tree corresponds to one possible alignment of thesessions. Starting from the root, FOGSAA expands the nodethat has the highest Fitness Score (see Definition 10). Fromeach node, there are three options:

• Align the next pages of both the sequences.

• Align the current page of first session with the nextpage of second session i.e., by inserting a gap in thefirst line.

• Place the next page of the first session with the currentpage of the second one, thereby inserting a gap in thesecond line.

The best option is selected at each place and the other nodesare saved in a priority queue, ordered by the Fitness Scorevalue. Then, after each path, FOGSAA checks the top node ofthe priority queue. If it shows higher score, then a branch is

284

Page 4: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

started from it. Otherwise, FOGSAA has reached the optimalalignment, as there is no other better node. The workflow ofFOGSAA is shown in Fig. 2. For details the readers may referto [15].

The entire procedure of FOGSAA is based on the com-putation of Fitness Score value. It is a combination of twoother scores, the Present Score and Future Score. The PresentScore is the sum of all match/mismatch/gap scores that havebeen encountered so far, while the Future Score is an estimatedscore value that may result when the remaining parts of thesequences would be aligned. These scores are shown below:

Definition 8. Present Score [15]Let the given pair of sessions bes1: ( P 1

1P12 . . . P 1

m) and s2: (P 21P

22 . . . P 2

n),where |s1| = m and |s2| = n. If the current node is atposition (p1, p2) i.e., p1 symbols from s1 and p2 symbolsof s2 have been checked, then the Present Score, denoted byPrS, is defined as:

PrS =∑

1≤i≤p1,1≤j≤p2

SCij (7)

The addition of scores for each node, from root to thecurrent node of the current branch, gives the present score.Here,

SCij =M if P 1

i = P 2j

SCij =Ms if P 1i ̸= P 2

j

SCij =G if P 1i = gap ∨ P 2

j = gap

(8)

Where M=Match Score, Ms=Mismatch Score and G=GapPenalty. Note that, these score values can be positive ornegative depending upon the requirement. As our objective isto maximize the number of matches by incorporating minimumnumber of mismatches and gaps, M usually takes positivevalue, where Ms and G are negative in general.

Definition 9. Future Score [15] If the two sequences to bealigned are P 1

1 . . . P 1m and P 2

1 . . . P 2n and the present node

is at position (p1, p2), then the two components Fmin andFmax of Future Score, for the subsequences P 1

p1+1 . . . P1m and

P 2p2+1 . . . P

2n , are defined as:

Fmin =

{−x2 ∗Ms+G ∗ (x1 − x2), x2 < x1

−x1 ∗Ms+G ∗ (x2 − x1), otherwise(9)

Fmax =

{x2 ∗M +G ∗ (x1 − x2), x2 < x1

x1 ∗M +G ∗ (x2 − x1), otherwise(10)

where, x1 = (n− p2) and x2 = (m− p1).

Definition 10. Fitness Score [15] The Fitness score of a node,based on which the potential of a branch is evaluated, is thesum of the Present score (PrS) and the Future score. FitnessScore, having two components denoted by Tmin and Tmax, isdefined as follows :

Tmin = PrS + Fmin (11)

Tmax = PrS + Fmax (12)

III. RESULT

Identifying the sessions and extracting the similarity be-tween them is a classical problem in web mining. As sessionsare viewed as a sequence web pages accessed over a specifictime window, aligning these sequences is the best way tomeasure the similarity between them. Here, a newly proposedfast global alignment tool, FOGSAA [15], is used to performthe alignment of these sessions. In this article, we show theresults of our experiment in two parts:

1) The time gain obtained by using FOGSAA for sessionalignment over standard methods.

2) Visualization of the clustered sessions based on thesimilarity measure as obtained by FOGSAA, wherethe page level similarity is computed using the pro-posed measure Psim.

A. Run Time Efficiency Achieved by FOGSAA

The state of the art methods based on the dynamicprogramming approach to align a pair of sequences are tooexpensive with respect to time. The widely used method ofNeedleman-Wunsch [12] takes O(m ∗ n) time for all thecases, best, worst or average, where m, n are the lengthof the two sequences. The existing methods for sequencealignment based web session clustering, like [10], [11],used the traditional NW [12] method to align the sessions.Therefore, the procedure is very slow and takes a hugeamount of time for the large log files. To overcome thisproblem, we have used the recently proposed faster alignmenttechnique, FOGSAA [15], to align the sessions. Though,the worst case time complexity of FOGSAA is bounded byO(m ∗ n), the average case time requirement is much lower[18]. In the best case, FOGSAA takes O(m+ n) time, whichis equal to the length of one complete path in FOGSAA tree.The percentage time gain of FOGSAA over NW to align thesessions, is mentioned in Table I. In our study, we have usedthe standard available datasets of NASA [17] and ClarkNet[19]. The experiment is conducted on an Intel Xeon(R) CPUX5650 @ 2.67GHz × 12, 64-bit machine with 48GB RAM.

Dataset No. of web pages Gap Penalty Mean Time in µsec % Time gainFOGSAA NW

NASA access log Jul95 5268 -1 1520 2590 41.3-2 1182 2377 50.27

NASA access log Aug95 4533 -1 1290 2294 43.76-2 1080 1897 43.06

ClarkNet Aug28 log 19670 -1 2038 2674 23.78-2 1459 1830 20.27

ClarkNet Sep4 log 19768 -1 2880 4089 29.56-2 2559 3670 30.27

TABLE I. TIME GAIN OF FOGSAA OVER NW ON 4 DIFFERENT WEBLOG DATASETS OF NASA AND CLARKNET, WITH VARYING GAP PENALTY

Table I shows the computational time requirement ofFOGSAA in comparison with NW, where the times are mea-sured for a 100 session window. The average time gain overall the 4 datasets is 35.84%. It can be seen from Table I thatin all the cases, including different gap penalties, FOGSAAoutperforms NW. This establishes that, FOGSAA is a time-efficient solution to align the sessions, which is able to preservethe quality as well. Fig. 3 depicts the histogram plot of thetime gain frequency distribution of FOGSAA over NW for

285

Page 5: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

100 session pairs. Fig. 3 plots the time gain in the x-axis,where the y-axis shows frequency distribution.

Fig. 3. Histogram plot of time gain of FOGSAA over NW

B. Clustering of the Sessions and Visualization

We have conducted our experiment on 4 datasets, two ofwhich are from NASA [17] and the rest are from ClarkNet[19]. The total no. of entries of NASA access log Jul95dataset is 1891714. After doing the initial preprocessing, ithas been reduced to 650899 entries, which is made of 5268different web pages. The sessions are identified setting 30mins time frame, as it is found suitable to capture the usernotion in our web log dataset. Fig. 4 shows the result ofhierarchical clustering (single linkage) of sessions in a two daywindow (which consists of 30 sessions), based on the similaritydistance as obtained from FOGSAA. Note that FOGSAAoutputs the same optimal distance score as NW.

Fig. 4. Dendrogram of Hierarchical Clustering on the Sessions ofNASA access log Jul95 in a 2 days window.

Though the sessions are clustered based on the pairwisedissimilarity or distance values given by FOGSAA, theexact coordinates of the sessions in the hyperspace remainsunknown. To visualize the cluster tendency of this data, wehave used Visual Assessment of (cluster) Tendency (VAT) tool[20]. VAT is a tool which maps the initial values of pairwisedissimilarity matrix into the intensities of an unordered image

Fig. 5. VAT plot of distance matrix of unordered NASA access log Jul95data in 2 days window.

as shown in Fig. 5.

Then, VAT performs the necessary reordering and an imageof Fig. 6 is produced where the dark blocks along the diagonalline indicate the clusters.

Fig. 6. Reordered image of VAT plot for NASA access log Jul95 data in 2days window.

IV. CONCLUSION

Session clustering is a fundamental process of analyzingweb log data in today’s advancement of web technology. Webmasters try to attract the users and gain profit by suggesting ad-ditional pages that match with his/her interest. The customersusually prefer the web sites where everything is availablequickly and easily. Therefore, a real time solution is neededto analyze the web log data and provide customized services.Our proposed method of session clustering using FOGSAA isa time-efficient approach in this regard, where the page levelsimilarity is accurately captured by the novel metric Psim.Moreover, if the two sequences being aligned are less than30% similar, FOGSAA [15] can detect it before computing the

286

Page 6: [IEEE 2013 IEEE Recent Advances in Intelligent Computational Systems (RAICS) - Trivandrum, India (2013.12.19-2013.12.21)] 2013 IEEE Recent Advances in Intelligent Computational Systems

total alignment. If it is the case, then FOGSAA terminates theprocedure with the best score obtained so far. This techniqueis very useful in session clustering to save the wastage ofeffort while aligning two highly dissimilar sequences. Theweb log data files are of huge size, where the number ofentries is in the order of lakhs. Clustering of sessions fromthese large files using traditional algorithms is a tedious job.A real time method specialized for session clustering is yet tocome. Moreover, the entire procedure can be made intelligentso that it can provide customized services taking into accountthe personal information of the customers, like age, economicstatus, community, etc.

REFERENCES

[1] T. Hussain, D. S. Asghar, and D. N. Masood, “Web usage mining: Asurvey on preprocessing of web log file,” in Information and EmergingTechnologies (ICIET), 2010.

[2] M. Awad and L. Khan, “Web navigation prediction using multipleevidence combination and domain knowledge,” IEEE Trans. Syst.,Man,Cybern. A, Syst., Humans, vol. 37, pp. 1054–1062, 2007.

[3] M. A. Awad and I. Khalil, “Prediction of users web-browsing behavior:Application of markov model,” IEEE Trans. Syst., Man,Cybern. A, Syst.,Humans, vol. 42, 2012.

[4] S. Alam, G. Dobbie, and P. Riddle, “Particle swarm optimization basedclustering of web usage data,” in 2008 IEEE/WIC/ACM InternationalConference on Web Intelligence and Intelligent Agent Technology, 2008.

[5] Y. Z. Guo, K. Ramamohanarao, and L. A. F. Park, “Personalizedpagerank for web page prediction based on access time-length andfrequency,” in 2007 IEEE/WIC/ACM International Conference on WebIntelligence, 2007.

[6] S.K.Madria, S.S.Bhowmick, W.K.Ng, and E.P.Lim, “Research issuesin web data mining,” in In Proceedings of Data Warehousing andKnowledge Discovery, First International Conference, DaWaK, 1999.

[7] R. Kosala and H. Blockeel, “Web mining research,” in A Survey, ACMSIGKDD Explorations Newsletter, vol. 2, June 2000.

[8] Y. Fu, K. Sandhu, and M.-Y. Shih, “Clustering of web users based onaccess patterns,” in In Proceedings of the 1999 KDD Workshop on WebMining, 1999.

[9] G. W. B. Hay and K. Vanhoof, “Mining navigation patterns using asequence alignment method,” Journal of Knowledge and InformationSystems, Springer-verlag, pp. 150–163, 2004.

[10] W. Wang and O. R. Zaı̈ane, “Clustering web sessions by sequencealignment,” in Database and Expert Systems Applications, 2002. Pro-ceedings. 13th International Workshop on. IEEE, 2002, pp. 394–398.

[11] C. Li and Y. Lu, “Similarity measurement of web sessions by sequencealignment,” in 2007 IFIP International Conference on Network andParallel Computing - Workshops, 2007.

[12] S. B. Needleman and C. D. Wunsch, “A general method applicable tothe search for similarities in the amino acid sequence of two proteins,”J. Mol. Biol., vol. 48, pp. 443–453, 1970.

[13] X. Huang and K. Chao, “A generalized global alignment algorithm,”Bioinformatics, vol. 19, pp. 228–233, 2003.

[14] A. Banerjee and J. Ghosh, “Clickstream clustering using weightedlongest common subsequences,” in In Proceedings of the Wokshop onWeb Mining, SIAM Conference on Data Mining, 2001, pp. 33–40.

[15] A. Chakraborty and S. Bandyopadhyay, “Fogsaa: Fast optimal globalsequence alignment algorithm,” Scientific Reports, vol. 3, p. 1746, 2013.

[16] A. Dekhtyar, “Bioinformatics algorithms,”http://http://users.csc.calpoly.edu/∼dekhtyar/448-Spring2013/lectures/lec07.448.pdf.

[17] “Nasa-http server log,” http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html.

[18] W. Zhang and R. E. Korf, “An average case analysis of branch andbound with applications :summary of results,” AAAI-92 Proceedings,1992.

[19] “Clarknet www server log,” http://ita.ee.lbl.gov/html/contrib/ClarkNet-HTTP.html.

[20] J. C. Bezdek and R. J. Hathaway, “Vat: A tool for visual assessment of(cluster) tendency,” in Neural Networks, 2002. IJCNN’02. Proceedingsof the 2002 International Joint Conference on, vol. 3. IEEE, 2002,pp. 2225–2230.

287