web usage mining using association rule mining on clustered data for pattern discovery

Vol 02, Issue 01, June 2013 International Journal of Data Mining Techniques and Applications http://iirpublications.com ISSN: 2278-2419

Integrated Intelligent Research (IIR) 141

Web Usage Mining Using Association Rule Mining on Clustered Data for Pattern Discovery

Shaily G.Langhnoja#1, Mehul P. Barot*2, Darshak B. Mehta#3 #Computer Department, M.E.(Pursing) Gujarat Technical University, Gujarat, India #I.T. Department, Lecturer Govt. Polytechniqe College,Gandhinagar, Gujarat, India

[email protected] [email protected]

*Associate Professor, L .D.R.P. Institute of Technology & Research, Gandhinagar, Gujarat, India [email protected]

Abstract— Web Usage Mining is application of data mining techniques to discover interesting usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Analyzing data through web usage mining can help effective Web site management, creating adaptive Web sites, business and support services, personalization, and network traffic flow analysis and so on. Lots of research has been done in this field while this paper emphasizes on finding user pattern in accessing website using web log record. The aim of this paper is to find user access patterns based on help of user’s session and behaviour. Web usage mining includes three phases namely pre-processing, pattern discovery and pattern analysis. In this paper combined effort of clustering and association rule mining is applied for pattern discovery. This approach helps in finding effective usage patterns. Keywords— Web Mining, Web Usage mining, Clustering, Association rules mining.

I. INTRODUCTION Web mining is the application of data

mining techniques to extract knowledge from Web data, in which at least one of structure or usage (Web log) data is used in the mining process. Researchers have identified three broad categories of Web mining.

A. Web content mining Web content mining is the process to discover useful information from text, image, audio or video data in the web. Web content mining sometimes is called web text mining, because the text content is the most widely researched area. The technologies that are normally used in web content mining are NLP (Natural language processing) and IR (Information retrieval).

B. Web structure mining Web structure mining operates on the Web’s hyperlink structure. Web structure mining is the process of using graph theory to analyze the node and connection structure of a web site.

This graph structure can provide information about ranking or authoritativeness and enhance search results of a page through filtering. According to the type of web structural data, web structure mining can be divided into two kinds. The first kind of web structure mining is extracting patterns from hyperlinks in the web. A hyperlink is a structural component that connects the web page to a different location. The other kind of the web structure mining is mining the document structure. It is using the tree-like structure to analyze and describe the HTML (Hyper Text Markup Language) or XML (eXtensible Markup Language) tags within the web page.

C. Web usage mining Web usage mining also known as web log mining, aims to discover interesting and frequent user access patterns from web browsing data that are stored in web server logs, proxy server logs or browser logs. Web usage mining is the application that uses data mining to analyze and discover interesting patterns of user’s usage data on the web. The usage data records the user’s behavior when the user browses or makes transactions on the web site.



It is an activity that involves the automatic discovery of patterns from one or more Web servers. The Web usage data includes the data from Web server access logs, proxy server logs, browser logs, user profiles, registration data, user sessions or transactions, cookies, user queries, bookmark data, mouse clicks and scrolls, and any other data as the results of interactions. Usage Mining tools discover and predict user behavior, in order to help designer to improve the web site, to attract visitors, or to give regular users a personalized and adaptive service.

II. WEB LOG FILES Web Log Files are files that contain

information about website visitor activity. Log files are created by web servers automatically. Each time a visitor requests any file (page, image, etc.) from the site, information of his request is appended to a current log file. Most log files have text format and each log entry (hit) is saved as a line of text. Log file range 1KB to 100MB. A. Location of weblog file: Web log file is located in three different locations.

Web server logs: Web log files provide most accurate and complete usage of data to web server. The log file do not record cached pages visited. Data of log files are sensitive, personal information so web server keeps them closed.

Web proxy server: Web proxy server takes HTTP request from user, gives them to web server, then result passed to web server and return to user. Client send request to web server via proxy server. The two disadvantages are: Proxy-server construction is a difficult task. Advanced network programming, such as TCP/IP, is required for this construction. The request interception is limited.

Client browser: Log file can reside in client’s browser window itself. HTTP cookies used for client browser. These HTTP cookies are pieces of information generated by a web server

and stored in user’s computer, ready for future access.

B. Type of web log file: There are four types of server logs.

Access log file: Data of all incoming request and information about client of server. Access log records all requests that are processed by server.

Error log file: list of internal error. Whenever an error is occurred, the page is being requested by client to web server the entry is made in error log .Access and error logs are mostly used, but agent and referrer log may or may not enable at server.

Agent log file: Information about user’s browser, browser version.

Referrer log file: This file provides information about link and redirects visitor to site.

C. Web log file format: Web log file is a simple plain text file which record information about each user. Display of log files data in three different format

W3C Extended log file format NCSA common log file format IIS log file format

NCSA and IIS log file format the data logged for each request is fixed.W3C format allows user to choose properties, user want to log for each request. Normally weblog file contains data such as……………

remotehost - domain name or IP address

rfc931 - the remote logname of the user Authuser - user identification used in a

successful SSL request [date] - the date and time of a request

(e.g. day, month, year, hour, minute, second, zone)

"request" - the request line exactly as it came from the client

status - three-digit HTTP status code returned to the client (such as 404 for Page not found, or 200 for Request fulfilled)

bytes - number of bytes returned to the client browser for the requested object ECFL has two additional elements:



referrer - URL of the referring server and the requested file from a site

agent - Browser and operating system name and version

III. PHASES OF WEB USAGE MINING PROCESS: - Web Usage Mining process is divided into

three phases Pre-Processing, Pattern Discovery & Pattern Analysis as shown in figure below.

Fig. 1 Phases of Web Usage Mining

Process

A. Pre-processing Phase The purpose of Data Pre-processing is to change a web data mining into reliable data. The normal procedure of data pre-processing includes 5 steps :data cleaning, user identification, user session identification, path completion and user transaction identification. 1) Data Cleansing: The task of data cleaning is to remove the irrelevant and redundant log entries for the mining process. There are two kinds of irrelevant or redundant data to be removed. They are:

Additional Requests: A user’s request to view a particular page often results in several log entries. Graphics and scripts are downloaded in addition to the HTML file, because of the connectionless nature of the HTTP protocol. Since the main intention of Web Usage Mining is to get a picture of the user’s behavior, it does not make sense to include file requests that the user did not explicitly request. Suffix part of an URL is checked and eliminates suffixes like gif, jpg, GIF, JPEG, css, map etc.

Entries with error: Status code shows

the success or failure of a request.

Entries with status code less then 200 and greater than 299 are failure entries which are to be removed.

Only necessary fields like date, time, IPaddress, User Agent, URL requested, URL referred, time taken are considered for further experiments to reduce the processing time So attribute subset selection is done. Following shows algorithm for Data Cleansing step in Pre-processing phase of Web Usage Mining process. Algorithm Name: Data Cleansing of Web Log File Input: Web Server Log file Output: Log Database Step1: Read Log Record from Web Log File Step2: If(Log Record .url-contains(gif.jpeg,jpg,css)) AND (different error like HTTP 404 or more) found then Remove from web log file. End of If condition. Step3: Repeat the above two steps until EOF(Web Log File) Step4: Stop the process. 2) User Identification: User identification can be done by IP address, cookies or user registration. Following is algorithm used for identifying number of distinct user from cleaned weblog file. Algorithm Name: User Identification. Input: Processed Web Log File. Output: Number of Distinct User. Step1: Read records from Web Log File. Step2: User’s IP addresses of two consecutive entries are compared. Step3: If (IP address is same) then check user’s browser and operating system if both are same then consider same user. else consider new user. end if end if Step 4: Repeat above 2 steps until EOF (Web Log File).



3) User session Identification: Used to find all page references made by user. We differentiate the entries into different user-sessions through a session timeout. If the time between page requests exceeds a certain limit, it is assumed that other user-session has started. We have used 30 minute timeout for session’s timeout property value. Following shows algorithm used for identifying number of session from weblog file. Algorithm Name: Session Identification from Web Log File Input: Web Server Log file Output: Number of Session Steps: SessionSet = {} UserSet = {} K = 0 While not EOF(LogFile) DO LogRecord = Read(LogFile) If(LogRecord.TimeTaken > 30 ) OR (LogRecord.UserId not in UserSet) then k = k+1 SK = LogRecord.URL SessionSet = SessionSet U {SK} Write(SessionFile , SessionSet) End If End While Thus after completion of Phase1 – Pre-processing we have cleaned Web Server Log file and prepared its data to be loaded into relational database. An activity of this phase is shown in figure below:

Fig. 2 Pre-Processing Phase of WUM

B. Pattern Discovery: It is used to find patterns using technique like..............

Path Analysis: Many different types of graphs can be formed from path analysis. The most obvious is a graph representing the physical layout of a website where web pages are nodes and hypertext links between pages are directed edges. Other graphs can be formed based on the types of web pages

visited. For example, the edges might represent the similarity between pages, or the edges could give the number of users traversing from one page to another. Most of the work to date involves determining frequent traversal patterns or large reference sequences from the physical graph layout.

Association rule: Association rule are used for prediction of next event or discovery of associated event. In the web data set, the transaction consists of the number of URL visits by the client, to the web site. Applying different association rule mining algorithm we can predict which are web pages frequently accessed together by users of website. The discovery of such rules from the access log can be of tremendous help in reorganizing the structure of the web site. The frequently accessed web pages should be organized in their order of importance and be easily accessible to the users. Classification: Classification is the technique to map a data item into one of several predefined classes. In the Web domain, Web master or marketer will have to use this technique if he/she want to establish a profile of users belonging to a particular class or category. This requires extraction and selection of features that best describe the properties of a given class or category. The classification can be done by using supervised inductive learning algorithms such as decision tree classifiers, naïve Bayesian classifiers, k-nearest neighbour classifier, Support Vector Machines etc. Clustering Clustering analysis is a technique to group together users or data items (pages) with the similar characteristics. Clustering of user information or pages can facilitate the development and execution of future marketing strategies. Clustering of users will help to discover the group of users, who have similar navigation pattern. It’s very useful for inferring user demographics to perform market segmentation in E-commerce applications or provide personalized Web content to the individual users. The clustering of pages is useful for Internet search engines and Web service



providers, since it can be used to discover the groups of pages having related content.

C. Pattern Analysis This phase uses techniques like..............

o OLAP/ Visualization Tool: For multidimensional analysis & Decision making.

o Knowledge Query Management o Intelligent Agents

IV. PATTERN DISCOVERY A. Association Rule Mining

The formal statement of association rule mining problem was firstly stated in [Agrawal et al. 1993] by Agrawal. Let I=I1, I2, … , Im be a set of m distinct attributes, T be transaction that contains a set of items such that T I, D be a database with different transaction records Ts. An association rule is an implication in the form of X Y, where X, Y I are sets of items called item sets, and X ∩ Y = ø. X is called antecedent while Y is called consequent, the rule means X implies Y. There are two important basic measures for association rules, support(s) and confidence(c). Since the database is large and users concern about only those frequently purchased items, usually thresholds of support and confidence are predefined by users to drop those rules that are not so interesting or useful. The two thresholds are called minimal support and minimal confidence respectively, additional constraints of interesting rules also can be specified by the users. The two basic parameters of Association Rule Mining (ARM) are: support and confidence. Support(s) of an association rule is defined as the percentage/fraction of records that contain X U Y to the total number of records in the database. The count for each item is increased by one every time the item is encountered in different transaction T in database D during the scanning process. It means the support count does not take the quantity of the item into account. For example in a transaction a customer buys three bottles of beers but we only increase the support count number of {beer} by one, in another word if a transaction contains a item then the support count of this item is increased by one. Support(s) is calculated by the following formula:

Support(XY ) = Support count of XY Total number of transaction in

D From the definition we can see, support of an item is a statistical significance of an association rule. Suppose the support of an item is 0.1%, it means only 0.1 percent of the transaction contain purchasing of this item. The retailer will not pay much attention to such kind of items that are not bought so frequently, obviously a high support is desired for more interesting association rules. Before the mining process, users can specify the minimum support as a threshold, which means they are only interested in certain association rules that are generated from those item sets whose supports exceed that threshold. However, sometimes even the item sets are not so frequent as defined by the threshold, the association rules generated from them are still important. For example in the supermarket some items are very expensive, consequently they are not purchased so often as the threshold required, but association rules between those expensive items are as important as other frequently bought items to the retailer. Confidence of an association rule is defined as the percentage/fraction of the number of transactions that contain X U Y to the total number of records that contain X, where if the percentage exceeds the threshold of confidence an interesting association rule XY can be generated. Confidence(XjY ) = Support(XY )

Support(X) Confidence is a measure of strength of the association rules, suppose the confidence of the association rule XY is 80%, it means that 80% of the transactions that contain X also contain Y together, similarly to ensure the interestingness of the rules specified minimum confidence is also pre-defined by users. The best known algorithms for mining association rules are Apriori, AprioriTID, STEM, DIC, Partition-Algorithm, Elcat, FP-grow, etc. In web usage mining, association rules are used to discover pages that are visited together quite often. Knowledge of these associations can be used either in marketing and business or as guidelines to web designers for (re)structuring Web sites. Transactions for mining association rules differ from those in market basket



analysis as they can not be represented as easily as in MBA (items bought together). Association rules are mined from user sessions containing remotehost, userid, and a set of urls. As a result of mining for association rules we can get, for example, the rule: X,Y Z (c=85%, s=1%). This means that visitors who viewed pages X and Y also viewed page Z in 85 % (confidence) of cases, and that this combination makes up 1% of all transactions in preprocessed logs. In (Cooley et al., 1999) a distinction is made between association rules based on a type of pages appearing in association rules. They identify Auxiliary-Content Transactions and Content-only transactions. The second one is far more meaningful as association rules are found only among pages that contain data important to visitors. Limitation of Association Rule Mining Algorithm One of the major drawbacks of associations rule mining is that too many rules are generated and no guarantee for all generated rules to be relevant. Minimum support and minimum confidence parameters are set in such a way to eliminate false discoveries. When minimum support is too small, every rule will get a chance to be true, leading to wrong recommendation and when minimum support is too large, for small data set, wrong predictions may occur. Solution to Limitation of ARM Clustering is process of grouping object with similar behaviour in different cluster. Clustering reduces the input data set to be small for Association rule mining, consequently the numbers of rules are reduced and the extracted rules are highly relevant and meaningful.

B. Clustering Algorithm Cluster is collection of data objects. Objects that are similar are in same cluster. Objects that are dissimilar are in other cluster. Clustering is unsupervised classification as no predefined class is there. A good clustering method will produce high quality clusters with high intra-class similarity and low inter-class similarity. The quality of a clustering result depends on both the similarity measure used by the method and its implementation. Similarity is expressed in terms of a distance function, which is

typically metric: d(i, j). The commonly used distance measure is the Euclidean distance, Minkowski distance, Manhattan distance. In general as shown in figure raw data are taken as input clustering algorithm is applied on it and we get clusters of data.

Fig 3 Working of Clustering Algorithm

DBSCAN Clustering Algorithm In this paper we have used DBSCAN clustering algorithm for pattern discovery. The DBSCAN algorithm can identify clusters in large spatial data sets by looking at the local density of database elements, using only one input parameter. Furthermore, the user gets a suggestion on which parameter value that would be suitable. Therefore, minimal knowledge of the domain is required. DBSCAN can also determine what information should be classified as noise or outliers. It is density-based clustering algorithm because it finds a number of clusters starting from the estimated density distribution of corresponding nodes. In spite of this, its working process is quick and scales very well with the size of the database. By using the density distribution of nodes in the database, DBSCAN can categorize these nodes into separate clusters that define the different classes. DBSCAN can find clusters of arbitrary shape. However, clusters that lies close to each other tend to belong to the same class. DBSCAN requires two parameters: (eps) and the minimum number of points required to form a cluster (minPts). It starts with an arbitrary starting point that has not been visited. This point's -neighborhood is retrieved, and if it contains sufficiently many points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized -environment of a different point and hence be made part of a cluster. If a point is found to be a dense part of a cluster, its -neighborhood is also part of that cluster. Hence, all points that are found within the -neighborhood are added, as is their own -neighborhood when they are also dense. This process continues until the density-connected cluster is completely found. Then, a new unvisited point is retrieved and processed,



leading to the discovery of a further cluster or noise. Advantage of DBSCAN Algorithm

DBSCAN does not require one to specify the number of clusters in the data a priori, as opposed to k-means.

DBSCAN can find arbitrarily shaped clusters. It can even find a cluster completely surrounded by (but not connected to) a different cluster. Due to the MinPts parameter, the so-called single-link effect (different clusters being connected by a thin line of points) is reduced.

DBSCAN has a notion of noise. DBSCAN requires just two parameters

and is mostly insensitive to the ordering of the points in the database.

V. PROPOSED APPROACH FOR PATTERN DISCOVER

The proposed approach for web usage mining is shown in figure below. This is a three step process.

1. The beginning is basically collecting of web log file and performing pre-processing operation on it. After pre-processing processed web log file is stored in database.

2. We can apply any data mining technique like association rule mining, classification or clustering for pattern discovery. Here combined approach of clustering and association rule mining is used. Thus in step 2 partitioning based clustering algorithm DBSCAN is used to find user’s having common behavior and access patterns.

3. Finally in step 3 association rule mining technique will be used to find user’s access patterns from this clustered group of data.

Fig. 4 Proposed Approach for Pattern

Discovery in WUM

Due to this approach clustering will be performed and every user will be assigned to specific cluster according to behaviour & access patterns. Applying association rule mining technique on this clustered data will help to find results having less computing time and better accuracy.

VI. IMPLEMENTATION Web Usage Mining is implemented on weblog file of website Joint Admission Committee for Professional Courses(JACPC) – www.jacpcldce.ac.in. Web Log file of this website of two months is considered as input dataset. Data Pre-processing and pattern discovery steps discussed above are applied on this weblog file. Based on result generated by simple association rule mining and clustered association rule mining we can say that combined approach produces rules that are more precise and meaningful. For Data Cleansing select tab and then web log file on which cleansing is applied.



Fig. 5 Selecting Data Cleansing tab

Fig. 6 selecting raw web log file for Cleansing Following screen shot shows cleaned data that is all irrelevant data not necessary for pattern discovery are eliminated.

Fig. 7 Raw web log file after Cleansing

After that unique number of user and session are indentified from cleansed web log file as shown below.

Fig. 8 Unique user & session identified

Now user has to specify clustering process need to be carried on which parameters. According to that points will be plotted for clustering data.

Fig. 9 Data Points for clustering process

To perform DBSCAN clustering algorithm we need to specify two parameters eps & minPts as shown in figure below.

Fig. 10 Specifying eps & MinPts for DBSCAN Different clusters according to value of eps & minPts are formed as shown in figure. Also we can find outlier or noise that does not fit to any clusters.



Fig. 11 Clusters produced using DBSCAN

Below figure shows clustering process summary details with clusters developed based on user-id and links accessed by that user. Now we can apply Association Rule Mining on this clustered data for getting more relevant and meaningful association rules.

Fig. 12 Clustering process summary details

After applying Association Rule Mining Apriori algorithm on clustered data we get result as shown below.

Fig. 13 Applying ARM on clustered data

VII. PERFORMANCE AND RESULT ANALYSIS

Following graph shows number of entries reduced due to data cleansing process. All ire-relevant entries are removed during cleansing process. Thus time required for pattern discovery will be reduced.

Fig. 14 Comparison of raw & processed weblog file data

Following table shows month wise summarized detail of number of rows before cleansing & after cleansing and % reduction in size of data for pattern discovery due to data cleansing.

TABLE 1.SUMMARIZED DETAIL OF WEBLOG FILE IN DATA PRE-PROCESSING PHASE

ACPC WEBSITE Duration Number

of rows before Cleansing

Number of rows before Cleansing

% Reduction in size

1 – JAN – 2013 to 28 – JAN – 2013

2,27,771 1,75,723 52.04%

Summarized detail of weblog files in data

pre-processing phase

Fig.15 Comparison of Simple ARM Vs

Clustered ARM Figure shows final results of comparison of number of rules generated due to simple ARM and combined effort of clustered and ARM technique for finding user’s access pattern discovery.



VIII. CONCLUSIONS Web Usage Mining techniques are great area

of research these days. Providing users what they are looking for in websites is the ultimate aim of web usage mining. In this approach, this aim is fulfilled by using association rule mining technique on clustered data i.e. data will be applied clustering techniques first and then we apply association rule technique for frequent accessed set of link. Basic Association Rule Mining may have drawback of generation of irrelevant rules, generation of too many rules leading to contradictory prediction resulting in reduction of accuracy. Minimum support and minimum confidence parameters can be set in such a way to eliminate false discoveries. But when minimum support is too small, every rule will get a chance to be true, leading to wrong result and when minimum support is too large, for small data set, wrong prediction may occur. Clustering frequent access patterns reduce data set for Association Rule Mining and improve result accuracy and producing results of pattern discovery of web usage mining process effective.

REFERENCES [1] J. Han and M. Kamber, “Data Mining:

Concepts and Techniques”. [2] [Cooley 97] R. Cooley, B. Mobasher and J.

Srivastava (1997). Web Mining: Information and Pattern Discovery on the Word Wide Web.

[3] [Cooley 99-1] R. Cooley, B. Mobasher, and J. Srivastava (1999), Data Preparation for Mining World Wide Web Browsing Patterns . Knowledge and Information Systems.

[4] [Cooley 99-2] R. Cooley, P. Tan and J. Srivastava (1999), Discovery of interesting usage patterns from Web data. Advances in Web Usage Analysis and User Profiling.

[5] Li Chaofeng, “ Research and Development of Data Preprocessing in Web Usage Mining”, International Conference on Management Science and Engineering , 2006.

[6] Cooley, BamshedMobasher, and JaideepSrivastava, “ Web mining: Information and Pattern Discovery on the World Wide Web”, In International conference on Tools with Artificial Intelligence.

[7] Agrawal, R., Imielinski, T., and Swami, A. N. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD I International Conference on Management of Data.

[8] K.R. Suneetha and Dr. R. Krihnamoorthi. 2009. Identifying User Behavior by Analyzing Web Server Access Log File. IJCSNS.

[9] Cooley, R., Mobasher, B., and Srivastava, J. "Web Mining: Information and Pattern Discovery on the World Wide Web; WEBMINER – a survey,” Tech Report, Department of Computer Science, University of Minnesota, Minneapolis,1997

[10] Wikipedia for DBSCAN clustering algorithm details.

web usage mining using association rule mining on clustered data for pattern discovery

Documents