    Web Usage Mining and PatternDiscovery: A Survey Paper


    Naresh Barsagade

    CSE 8331

    December 8, 2003

    1. IntroductionWeb technology is not evolving in comfortable and incremental steps, but it is turbulent,

    erratic, and often rather uncomfortable. It is estimated that the Internet, arguably the

    most important part of the new technological environment, has expanded by about 2000

    % and that is doubling in size every six to ten months. In recent years, the advance in

    computer and web technologies and the decrease in their cost have expanded the

    means available to collect and store data. As an intermediate consequence, the amount

    of information (Meaningful data) stored has been increasing at a very fast pace.

    Traditional information analysis techniques are useful to create informative reports from

    data and to confirm predefined hypothesis about the data. However, huge volumes of

    data being collected create new challenges for such techniques as organizations look for

    ways to make use of the stored information to gain an edge over competitors. It is

    reasonable to believe that data collected over an extended period contains hidden

    knowledge about the business or patterns characterizing customer profile and behavior.

    With the rapid growth of the World Wide Web, the study of knowledge discovery in web,

    modeling and predicting the users access on a web site has become very important


    From the administration, business and application point of view, knowledge obtained

    from the Web usage patterns could be directly applied to efficiently manage activities

    related to e-Business, e-CRM, e-Services, e-Education, e-Newspapers, e-Government,

    Digital Libraries, and so on [AR2003]. Web is becoming the necessity of the businesses

    and organizations because of its demand from the clients. Since the web technology

    largely feeds on ideas and knowledge rather than being dependent on fixed assets, it

    gave birth to new companies such as Yahoo, Google, Netscape, e-Bay, e-Trade,

    Expedia, Amazon and so on. With the large number of companies using the Internet to

    distribute and collect information, knowledge discovery on the web has become an

    important research area [JTP2002]. With the explosive growth of information sources

    available on the World Wide Web, it has become necessary for organizations to discover

    the usage patterns and analyze the discovered patterns to gain an edge over


    Jespersen et al [JTB2002] proposed a hybrid approach for analyzing the visitor click

    stream sequences. A combination of hypertext probabilistic grammar and click fact table

    approach is used to mine Web logs, which could be also used for general sequence

    mining tasks. Mobasher et al [MCS1999] proposed the web personalization system,

    which consists of offline tasks related to the mining if usage data and online process of

    automatic Web page customization based on the knowledge discovered. LOGSOM

    (LOGSOM, a system that utilizes Kohonen's self-organizing map (SOM) to organize web

    pages into a two-dimensional map) proposed by Smith et al [SN2003], utilizes a self-

    organizing map based solely on the users navigation behavior, rather than the content

    of the web pages. LumberJack proposed by Chi et al [CRHL2002] builds up user profiles

    by combining both clustering of user sessions and traditional statistical traffic analysis

    using kmeans algorithm. Joshi et al [JJYK1999] used relational online analytical

    processing approach for creating a Web log warehouse using access logs and mined

    logs. A comprehensive overview of web usage mining research is found in [SCDT2000,

    CMS97, CMS1999, RWC2000].

    Web mining can be divided into three areas, namely web content mining, web structure

    mining and web usage mining [SCDT2000]. Web Content mining focuses on discovery of

    information stored on the Internet. Web Structure mining focuses on improvement in

    structural design of a website. Web Usage mining, the main topic of this paper, focuses

    on knowledge discovery from the usage of individuals web sites.

    Global Internet Usage Average Usage [NN2003] shows the current usage around the

    globe and in United States.

    Month of September 2003, Panel Type: Home

    September August %ChangeNumber of Sessions per Month 22 22 1.65

    Number of Unique Domains Visited 55 54 0.89

    Page Views per Month 901 899 0.3

    Page Views per Surfing Session 41 41 0

    Time Spent per Month 11:59:20 11:50:30 1.24

    Time Spent During Surfing Session 0:32:29 0:32:37 -0.4

    Duration of a Page Viewed 0:00:48 0:00:47 0.94

    Active Internet Universe 252,672,070 253,054,814 -0.15

    Current Internet Universe Estimate 419,054,724 416,339,888 0.65

    United States: Average Web Usage

    Month of October 2003, Panel Type: Home

    Sessions/Visits Per Person 71Domains Visited Per Person 103PC Time Per Person 80:46:37Duration of a Web Page Viewed 0:01:00Active Digital Media Universe 47,003,165Current Digital Media Universe Estimate 51,012,930

    The remainder of the paper is organized as follows: Section 2 contains applications of

    web usage mining, section 3 contains basic components of web mining terminologies,

    taxonomy of web mining, architecture of web usage mining, explanation of individual

    components in web usage mining architecture, section 4 summarizes the paper,

    identifies several future research directions and section 5 contains the bibliography.

    2. Appl ications of Web Usage Mining

    Each of the applications can benefit from patterns that are ranked by subjective


    Web usage mining is used in the following areas:

    Web usage mining offers users the ability to analyze massive volumes of

    clickstream or click flow data, integrate the data seamlessly with transaction and

    demographic data from offline sources and apply sophisticated analytics for web

    personalization, e-CRM and other interactive marketing programs.

    Personalization for a user can be achieved by keeping track of previously

    accessed pages. These pages can be used to identify the typical browsing

    behavior of a user and subsequently to predict desired pages.

    By determining frequent access behavior for users, needed links can be identified

    to improve the overall performance of future accesses.

    Information concerning frequently accessed pages can be used for caching.

    In addition to modifications to the linkage structure, identifying common access

    behaviors can be used to improve the actual design of Web pages and to make

    other modifications to the site.

    Web usage patterns can be used to gather business intelligence to improve

    Customer attraction, Customer retention, sales, marketing and advertisement,

    cross sales.

    Mining of web usage patterns can help in the study of how browsers are used

    and the users interaction with a browser interface.

    Usage characterization can also look into navigational strategy when browsing a

    particular site.

    Web usage mining focuses on techniques that could predict user behavior while

    the user interacts with the Web.

    Web usage mining helps in improving the attractiveness of a Web site, in terms

    of content and structure.

    Performance and other service quality attributes are crucial to user satisfaction

    and high quality performance of a web application is expected.

    Web usage mining of patterns provides a key to understanding Web traffic

    behavior, which can be used to deal with policies on web caching, network

    transmission, load balancing, or data distribution.

    Web usage and data mining is also useful for detecting intrusion, fraud, and

    attempted break-ins to the system.

    Web usage mining can be used in

    e-Learning, e-Business, e-Commerce, e-CRM, e-Services, e-Education, e-

    Newspapers, e-Government, and Digital Libraries.

    Web usage mining can be used in

    Customer Relationship Management, Manufacturing and Planning,

    Telecommunications and Financial Planning.

    Web usage mining can be used in

    Physical Sciences, Social Sciences, Engineering, Medicine, and Biotechnology.

    Web usage mining can be used in

    Counter Terrorism and Fraud Detection, and detection of unusual accesses to

    secure data.

    Web usage mining can be used in determination of common behaviors or traits

    of users who perform certain actions, such as purchasing merchandise.

    Web usage mining can be used in usability studies to determine the interface


    Web usage mining can be used in network traffic Analysis for determining

    equipment requirements and data distribution in order to efficiently handle site


    3. Web Usage Mining and Pattern DiscoveryWeb usage mining is the application of data mining techniques to discover usage

    pattern from Web data, in order to understand and better serve the needs of Web-

    based applications [CMS1997]. Web usage mining consists of three phases, namely

    preprocessing, pattern discovery, and pattern analysis. A high level Web usage mining

    Process is presented in Figure 1 [SCDT2000]. Mobasher et al. [CMS1997] proposes that

    the web mining process can be divided into two main parts. The first part includes the

    domain dependent processes of transforming the Web data into suitable transaction

    form. This includes preprocessing, transaction identification, and data integration

    components. The second part includes some data mining and pattern matching

    techniques such as association rule and sequential patterns. In the absence of cookies

    or dynamically embedded session Ids in the URIs, the combination of IP address can be

    used as a first pass estimate of unique users. This estimate can be refined using the

    referrer field as described in [CMS1999]. Some authors have proposed global

    architectures to handle the web usage mining process. Cooley et al [CTS1999] proposed

    a site information filter, named WebSIFT that establishes a framework for web usage

    mining as shown in Figure 2. The WebSIFT performs the mining in distinct tasks.

    WeSift system divides the Web Usage Mining Process into three main parts, as show in

    Fig 1. For a particular Web site, the three server logs access, referrer, and agent (often

    combined into a single log), the HTML files, template files, script files or databases that

    make up the site content, and any optional data such as registration data or remote

    agent logs provide the information to construct the different information abstractions.

    The preprocessing phase uses the input data to construct a server session file based on

    the method and heuristics discussed in [[CMS, 1999]. In order to preprocess a server

    log, the log must first be cleaned, which consists of removing unsuccessful requests,

    parsing relevant CGI name/value pairs and rolling up file accesses into page views. Once

    the log is converted into a list of page views, users must be identified. In the absence of

    cookies or dynamically embedded session Ids in the URIs, the combination of IP address

    The first is preprocessing state in which user sessions are inferred from log data. The

    second searches for patterns in the data by making use of standard data mining

    techniques, such as association rules or mining for sequential patterns. In the third

    stage an information filter bases on domain knowledge and the web site structures is

    applied to the mining patterns in search for the interesting patterns. Links between

    pages and the similarity between contents of pages provide evidence that pages are

    related. The preprocessing phase allows the option of converting the server sessions

    into episodes prior to performing knowledge discovery.

    Figure 2: A General Architecture for Web Usage Mining

    In this case, episodes are either all of the page views in a server sessions that the user

    spent a significant amount of time viewing, or all of the navigation page views leading

    up to each content page view. The details of how a cutoff time is determined for

    classifying a page view as content or navigation are also contained in [CMS1999]. The

    click-stream or click-flow for each user is divided into sessions based on a simple thirty-

    minute timeout. The notion of what makes discovered knowledge interesting has been

    addressed in [PT1998]. A survey of methods that have been used to characterize the

    interestingness of discovered patterns is given in [HH1999]. Four dimensions used by

    [HH1999] to classify interestingness measures are pattern-form, representation, scope,

    and class. Pattern-form defines what type of patterns a measure is applicable to, such

    as association rules or classification rules. The representation dimension defines the

    nature of the framework, such as probabilistic or logical. Scope is a binary dimension

    that indicates whether the measure applies to single pattern, or to the entire discovered

    set. The final dimension, class is also a binary dimension that can be labeled as

    subjective or objective.

    Preprocessing for the content and structure of a site involves assembling each page

    view for parsing and /or analysis. Page views are accessed through HTTP requests by a

    site crawler to assemble the components of the page view. This handles both static

    and dynamic content. In addition to being used to derive a site topology, the site files

    are used to classify the pages of a site. Both the site topology and page classification an

    then be fed into the information filter. The knowledge discovery phase uses existing

    data mining techniques to generate rules and patterns. Included in this phase is the

    generation of general usage statistics, such as number of hits per page, page most

    frequently accessed, most common starting page, and average time spent on each


    The WebSIFT performs the mining in distinct tasks. The first state is preprocessing in

    which user sessions are inferred from log data. The second searches for patterns in the

    data by making use of standard data mining techniques, such as association rules or

    mining for sequential patterns. In the third stage an information filter bases on domain

    knowledge and the web site structures is applied to the mining patterns in search for

    the interesting patterns. Links between pages and the similarity between contents of

    pages provide evidence that the pages are related. This information is used to identify

    interesting patterns, for example, itemsets that contain pages not directly connected are

    declared interesting. In Mobasher et al [MCS1999] the authors propose to group the

    itemsets obtained by the mining stage in cluster of URL references. These clusters are

    aimed at real time web page personalization. A hypergraph is inferred from the mined

    itemsets where the nodes correspond to pages and the hyperedges connect pages in a

    itemset. The weight of a hyperedge is given by the confidence of the rules involved. The

    graph is subsequently partitioned into clusters and an occurring user session is matched

    against such clusters. For each URL in the matching clusters a recommendation score is

    computed and the recommendation set is composed by all the URL whose

    recommendation score is above a specified threshold.

    In Buchner et al. [BBAMH1999] a new approach, in the form of process, is proposed to

    find marketing intelligence from Internet data. An n-dimensional web log data cube is

    created to store the collected data. Domain knowledge is incorporated into the data

    cube in order to reduce the pattern search space. They proposed an algorithm to extract

    navigation patterns from the data cube. The patterns conform to pre-specified

    navigation templates whose use enables the analyst to express his knowledge about the

    field and to guide the mining process. This model does not store the log data in compact

    form, and that can be major drawback when handling very large daily log files.

    Information on how customers are using a Web site is critical for marketers of electronic

    commerce businesses. Buchner et al [BM1998] have presented a knowledge discovery

    process in order to discover marketing intelligence from Web data. They define a Web

    log data hypercube that consolidates Web usage data along with marketing data for

    electronic commerce applications. Four distinct steps are identified in customer

    relationship life cycle that can be supported by their knowledge discovery techniques:

    customer attractions, customer retention, cross sales and customer departure.

    In Masseglia et al [MPC1999] proposed an integrated tool for mining access patterns

    and association rules from log file. The techniques implemented pay particular attention

    to the handling of time constraints, such as the minimum and maximum time gap

    between adjacent requests in a pattern. The system provides a real time generator of

    dynamic links, which aimed at automatically modifying the hypertext organization when

    user navigation matches a previously mined rule.

    Fundamental methods of data cleaning and preparation have been well studied by

    Srinivasa et al [SCDT2000]. The main techniques traditionally used for modeling usage

    patterns in a Web site are collaborative filtering (CF), clustering pages or user sessions,

    association rule generation, sequential pattern generation and Markov Models. The

    prediction step is the real-time processing of the model, which considers the active user

    session and makes recommendations based on the discovered patterns. The time spent

    on a page is a good measure of the users interest in that page, providing an implicit

    rating for it [GO2003]. If a user is interested in the content of a page, she will likely

    spend more time there compared to the other pages in her session. They presented a

    new model that uses both the sequences of visiting pages and the time spent on that

    pages which reflects the structural information of user session and handles two-

    dimensional information.

    Data preprocessing consists of data filtering, user identification, session/transaction

    identification, and topology extraction. Data filtering filters out some noise, i.e.,

    unsuccessful requests, automatically downloaded graphics, or requests from robots, to

    get more compact training data. Now people use some heuristic rules to identify user,

    such as IP address, cookies, etc. Preprocessing consists of converting the usage,

    content, and structure information contained in the various available data sources into

    the data abstractions necessary for pattern discovery.

    Us age p r ep ro c e s s in g : Usage preprocessing consists of Web pages, such as IPaddresses, page references, and the date and time of accesses [SCDT2000]. Typically,

    the usage data comes from an Extended Common Log Format (ECLF) Server log


    Con ten t P r ep r o c e s s in g : Content preprocessing consists of converting the text,images, scripts, and multimedia data into forms that are useful for the web usage

    mining process. Often this consists of performing content mining such as classification

    or clustering. In the context of web usage mining, the content of Web sites can be used

    to filter the input to the pattern discovery algorithms [SCDT2000].

    Structure Preprocessing: Web structure mining analyses the link structure of theweb in order to identify relevant documents [SCDT2000]. The structure of a site is

    created by the hypertext links between page views. Intra-page structure information

    includes the arrangement of various HTML or XML tags within a given page. The

    principal kind of inter-page structure information is hyper-links connecting one page to

    another. The Google Search engine [GOO] makes use of the web link structure in the

    process of determining the relevance of a page. The Google search engine achieves

    good results because while the keyword similarity analysis ensures high precision the

    use of a probability measure ensures high quality of the pages returned.

    The information provided by the data sources listed above can be used to construct a

    data model consisting of several data abstractions, notably users, page views, click-

    streams, server sessions, and episodes [RWC2000]. A page view is defined as all of the

    files that contribute to the client-side presentation seen as the result of a single mouse

    click of a user. A click-stream is then the sequence of page views that are accessed by

    a user. A server session is the click-stream is then sequence of page views that are

    accessed by a user. A server session is the click-stream for a single visit of a user to a

    Web site. Finally, an episode is a subset of page views from a server session. Data can

    be collected at the server-level, client-level, proxy-level, or obtained from an

    organizations database. Each type of data collection differs not only in terms of the

    location of the data source, but also in the kinds of data available, the segment of

    population from which the data was collected, and its method of implementation.

    The usage data collected at the different sources such as Server level, Client Level and

    Proxy Level represent the navigation patterns of different segments of the overall Web

    traffic [SCDT2000].

    Se r ve r - l e ve l C o l l e c ti o n: A Web server log records the browsing behavior of sitevisitors [SCDT2000]. The data recorded in server logs reflect the concurrent and

    interleaved access of a Web site by multiple users. These log files can be stored in

    various formats such as Common Log Format (CLF) or Extended Common Log Format

    (ECLF). ECLF contains client IP address, User ID, time/date, request, status, bytes,

    referrer, and agent. Tracking of individual users is not an easy task due to the stateless

    connection model of the HTTP protocol. In order to handle this problem, Web servers

    can also store other kind of usage information such as cookies in separate logs, or

    appended to the CLF or ECLF logs. Cookies are tokens generated by the Web server for

    individual client browsers in order to automatically track the site visitors. Packet sniffing

    technology (also referred to as network monitors) is an alternative method for

    collecting usage data through server logs. Packet sniffers monitor network traffic coming

    to a Web server and extract usage data directly from TCP/IP packets. Besides usage

    data, the server side log also provides access to the site files, e.g. content data,

    structure information, local databases, and Web page meta-information such as the size

    of a file and its last modified time.

    C l i en t l eve l co l l e c t i on : Client-side collection can be implemented by using a remoteagent (such as Java scripts or Java applets) or by modifying the source code of an

    existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities

    [SCDT2000]. Proxy Level Collection: The Internet Service Provider (ISP) machine that

    users connect to through a model is a common form of proxy server. A web proxy acts

    as an intermediary between client browsers and Web servers. Proxy-level caching can

    be used to reduce the loading of time of a Web page experienced by users as well as

    the network traffic load at the server and client sides.

    Pa t t e r n D i s c o ve r y : Pattern discovery uses methods and algorithms developed fromseveral fields such as statistics, data mining, machine learning and pattern recognition

    [SCDT2000]. Zaiane et al. [ZXH1998] proposed the use of On-Line Analytical Processing

    (OLAP) technology in web usage mining. OLAP and the data cube structure offer a

    highly interactive and powerful data retrieval and analysis environment. The knowledge

    that can be discovered is represented in the form of rules, tables, charts, graphs, and

    other visual presentation forms for characterizing, comparing, predicting, or classifying

    data from the web access log. Visualization can also be used in web usage mining, and

    it presents the data in the way that can be understood by users more easily.

    S ta t i s t i c a l Ana l y s i s : Statistical techniques are the most common method to extractknowledge about visitors to a web site. By analyzing the session file, one can perform

    different kinds of descriptive statistical analyses (frequency, mean, median, etc) on

    variables such as page views, viewing time and length of a navigational path. For

    example e-Trade developed a website in German language for Germany and scrapped it

    because German people were visiting the English site rather than the German site. Many

    web traffic analysis tools produce a periodic report containing statistical information

    such as the most frequently accessed pages, average view time of a page or average

    length of a path through a site. This type of knowledge can be potentially useful for

    improving the system performance, enhancing the security of the system, facilitating the

    site modification task, and providing support for marketing decisions. There are lots of

    commercial tools available for statistical analysis.

    Assoc i a t i on Ru l es : Association rule generation can be used to relate pages that aremost often referenced together in a single server sessions [SCDT2000]. In the context

    of web usage mining, association rules refer to sets of pages that are accessed together

    with a support value exceeding some specified threshold. Association rule mining has

    been well studied in Data Mining, especially for basket transaction data analysis. Many

    association rule algorithms have been used, such as Apriori, Partition [MHD2003]. Aside

    from being applicable for e-Commerce, business intelligence and marketing applications,

    it can help web designers to restructure their web site. The results about the usefulness

    of such rules in supermarket transaction or in web application have not been reported.

    People also put some constraints over the mining process, and prune the extracted

    rules. The association rules may also serve as heuristic for pre fetching documents in

    order to reduce user-perceived latency when loading a page from a remote site. In

    electronic CRM, an existing customer can be retained by dynamically creating web offers

    based on associations with threshold support and/or confidence value [BM98].

    C lu s te r i ng : Clustering is a technique to group together a set of items having similarcharacteristics [SCDT2000]. Clustering can be performed on either the users or the page

    views. Clustering analysis in web usage mining intends to find the cluster of user, page,

    or sessions from web log file, where each cluster represents a group of objects with

    common interesting or characteristic. User clustering is designed to find user groups

    that have common interests based on their behaviors, and it is critical for user

    community construction. Page clustering is the process of clustering pages according to

    the users access over them. Such knowledge is especially useful for inferring user

    demographics in order to perform market segmentation in e-Commerce applications or

    provide personalized web content to the users. On the other hand, clustering of pages

    will discover groups of pages having related content. This information is useful for the

    Internet search engines and Web assistance providers. In both applications, permanent

    or dynamic HTML pages can be created that suggest related hyperlinks to the user

    according to the users query or past history of information needs. The intuition is that if

    the probability of visiting page, given page has also been visited, is high, then maybe

    they can be grouped into one cluster. For session clustering, all the sessions are

    processed to find some interesting session clusters. Each session cluster may be one

    interesting topic within the web site. Mobasher et al [MCS1999] generated

    recommendations from URL clusters to build an adaptive web site by using ARHP

    (Association Rule Hypergraph Partitioning).

    Abhrahum et al [AR2003] proposed an ant-clustering algorithm to discover web usage

    patterns and a linear genetic programming approach to analyze the visitor trends. They

    proposed hybrid framework, which uses an ant colony optimization algorithm to cluster

    Web usage patterns. The raw data from the log files are cleaned and preprocessed and

    the ACLUSTER algorithm is used to identify the usage patterns. The developed clusters

    of data are fed to a linear genetic programming model to analyze the usage trends.

    The WebCANVAS (Web Clustering Analysis and VisuAlization of Sequences) [CHMSW,

    2003] presented a new methodology for exploring and analyzing navigation pattern on a

    web site. The patterns that can be analyzed consist of sequences of URL categories

    traversed by users. In their approach, they first partitioned site users into clusters such

    that users with similar navigation paths through site are places into the same cluster.

    The clustering approach they employed was model-based (as opposed to distance

    based) and partitioned users according to the order in which they request web pages.

    Another feature of their use of model-based clustering is that learning time scales

    linearly with sample size. In contrast, agglomerative distance-based methods scale

    quadratically with sample size.

    The purpose of knowledge discovery from users profile is to find clusters of similar

    interests among the users [SZAS1997]. If the site is well designed, there will be strong

    correlation among the similarity of the navigation paths and similarity among the users

    interest. Therefore, clustering of the former could be used to cluster the latter. The

    definition of the similarity is application dependent. They provide an overview on a

    powerful path clustering method called path mining. This approach is suitable for

    knowledge discovery in databases with partial ordering in their data. In this method,

    first a general path feature space is characterized. Then a similarity, measure among the

    paths over the feature space is introduced. Finally this similarity measure is used in the

    clustering purpose. They implemented the path-mining algorithm to cluster the

    navigation paths detected by the profiler. This algorithm finds a scalar number as the

    similarity among the paths. These similarity numbers could be fed to standard data-

    mining algorithms to cluster the user interests.

    C l as s i f i c a t i on : Classification is the task of mapping a data item into one of severalpredefined classes [SCDT2000]. In the internet marketing, a customer can be classified

    as no customer, visitor once and visitor regular based on their browsing patters and

    discovered rules for attracting the customers by displaying special offers [BM98].

    In the web domain, one is interested in developing a profile of users belonging to a

    particular class or category. This requires extraction and selection of features that best

    describe the properties of a given class or category. Classification can be done by using

    supervised inductive learning algorithms such as decision tree classifiers, nave Baysian

    classifiers, k-nearest neighbor classifiers, Support Vector Machines etc. For example,

    classification on server logs may lead to the discovery of interesting rules such as: 30%

    of users who placed an online order in /Product/Music are in the 18-25 age group and

    live on the west coast. The Classification algorithms such as C4.5, CART, BAYES, and

    RIPPER can be used to predict if page is of interest to the user.

    Sequen t i a l P a t t e r n s : The technique of sequential pattern discovery attempts to findinter-session patterns such that the presence of a set of items is followed by another

    item in a time-ordered set of sessions or episodes [SCDT2000]. A new algorithm MiDAS

    (Mining Internet data for Associative Sequences) for discovering sequential patterns

    from web log files has been proposed that provides behavioral marketing intelligence for

    e-commerce scenarios [BBAMH1999]. MiDAS contains three phases: 1. A priori phase is

    the input data preparation, which consists of data reduction and data type substitution.

    2. Discovery Phase discovers the sequences of hits and generates the pattern tree. 3. A

    posteriori Phase filters out all sequences that do not fulfill the criteria laid in the

    specified navigation templates and topology network and also pruning is done in this

    phase. By using this approach, Web marketers can predict future visit patterns, which

    will be helpful in placing advertisements aimed at certain user groups. Other types of

    temporal analysis that can be performed on sequential patterns include trend analysis,

    change point detection, or similarity analysis.

    Oyanagi et al [OKN2002] explore the issues in sequence mining for methods for mining

    WWW access log. The Apriori algorithm is well known as a typical algorithm for

    sequence pattern mining. However, it suffers from inherent difficulties in finding long

    sequential patterns and in finding interesting patterns among a huge amount of results.

    This paper proposes a new method for finding sequence patterns by matrix clustering.

    This method decomposes a sequence into a set of sequence elements, each of which

    corresponds to an ordered pair of items. Then matrix clustering is applied to extract a

    cluster of similar sequences. The resulting sequence elements are composed into a

    graph. A Web Utilization Miner, WUM [SS1998] uses an efficient data structure called

    Aggregated Tree to store the user sessions, and it also provides query language to

    extract interesting patterns from the aggregated session data. WUM employs an

    innovative technique for the discovery of navigation patterns over an aggregated

    materialized view of the web log. After performing the classical preparation steps (i.e.,

    user and session identification) the user sessions are merged into Aggregated Tree. An

    Aggregated Tree is a tree constructed by merging trails with the same prefix. WUM

    provides a query language called MINT to let the users specify their query, concerning

    the content, structure and statistics of navigation patterns. MINT supports the

    specification of criteria of statistical, structural, and textual nature. The WEBMIER tool

    [CMS1997] provides a query language on top of external mining software for association

    rules and for sequential patterns.

    D e p e n d e n cy m o d e l in g : Dependency modeling is another useful pattern discoverytask in web mining [SCDT2000]. The goal here is to develop a model capable of

    representing significant dependencies among the various variables in the web domain.

    As an example, one may be interested to build a model representing the different stages

    a visitor undergoes while shopping in an online store based on the actions chosen (ie,

    from a casual visitor to a serious potential buyer. There are several probabilistic learning

    techniques that can be employed to model the browsing behavior of users. Such

    techniques include Hidden Markov Models and Bayesian Belief Networks. Modeling of

    Web usage patterns will not only provide a theoretical framework for analyzing the

    behavior of users but is potentially useful for predicting future Web resource

    consumption. Such information may help develop strategies to increase the sales of

    products offered by the Web site or improve the navigational convenience of users.

    Borgees et al [BL1999] proposed formal data mining model, Hypertext probabilistic

    Grammars (HPG) to capture user web navigation patterns. User sessions are presented

    as HPG whose higher probability strings correspond to the navigation trails preferred by

    the user. Hypertext Probabilistic Grammar (HPG) is a Markov model, which assumes that

    the probability of a link being chosen depends more on the contents of the page being

    viewed than on all the previous history of sessions [LL1999]. Note that this assumption

    can be weighted by making use of the Ngram concept, or dynamic Markov chain

    techniques There are situations in which a Markov assumption is realistic, such as, for

    example, an online newspaper where a user chooses which article to read in the sports

    section independently of the contents of the front page. However, there are also cases

    in which such assumption is not very realistic, such as, for example, an online tutorial

    providing a sequence of pages explaining how to perform a given task.

    D ev i a t i on /Ou t l ie r D e t e c t i on : It contains techniques aimed at detecting unusualchanges in the data relatively to the expected values. Such techniques are useful, for

    example, in fraud detection, where the inconsistent use of credit cards can identify

    situations where a card is stolen. The inconsistent use of credit card could be noted if

    there were transactions performed in different geographic locations within a given time


    Pa t t e r n ana l y s is :Pattern analysis is the last step in the overall Web Usage miningprocess as described in Figure 2. The motivation behind pattern analysis is to filter out

    uninteresting rules or patterns from the set found in the pattern discovery phase. The

    exact analysis methodology is usually governed by the application for which Web mining

    is done. The most common form of pattern analysis consists of a knowledge query

    mechanism such as SQL. Another method is to load usage data into a data cube in order

    to perform OLAP operations. Visualization techniques, such as graphing patterns or

    assigning colors to different values, can often highlight overall patterns or trends in the

    data. Content and structure information can be used to filter out patterns containing

    pages of a certain usage type, content type, or pages that match a certain hyperlink


    4. Summary and Future Research Direct ionsThis paper has attempted to provide an up-to-date survey of the rapidly growing area of

    Web usage mining, which is the demand of current technology. In this paper a general

    overview of Web usage mining is presented in introduction section. Web usage mining is

    used in many areas such as e-Business, e-CRM, e-Services, e-Education, e-Newspapers,

    e-Government, Digital Libraries, advertising, marketing, bioinformatics and so on. The

    major classes of recommendation services are based on the discovery of navigational

    patterns of users. The main techniques for pattern discovery are sequential patterns,

    association rules, Classification, Clustering, and path analysis. Web usage minings basic

    components, taxonomy of web mining, architecture of web usage mining, individual

    components in web usage mining and detailed research in this area by researchers like

    Jaideep Srivastava, Bamshad Mobasher, Robert Cooley, Cyrus Shahabi, Ming-Syan Chen,

    and A.G. Bchner in web mining is described in detail section.

    With the growth of Web-based applications, specifically e-commerce, there is significant

    interest in analyzing Web usage data. As the web mining area is growing fast, there is a

    lot of demand for web usage mining and there is a need to develop a common

    framework like J2EE and .NET. Cross Industry Standard Process for Data Mining, the

    CRISP-DM project has developed an industry and tool-neutral Data Mining process

    model [CRISP-DM] for data mining. Similar Process model or framework needs to be

    developed for creating an interest among the new researchers or business strategists

    and developers. We need a systematic web-site design methodology to create new web

    pages, or modify existing web pages, such that different users navigation patterns could

    be better mapped to answers to a set of specific questions. There is a need to develop

    tools, which incorporate statistical methods, visualization, and human factors to help

    better understand the mined knowledge. Since the output of knowledge mining

    algorithms is often not in a form suitable for direct human consumption, there is a need

    to develop techniques and tools for helping an analyst better assimilate it. One of the

    open issues in data mining, in general and Web Mining, in particular, is the creation of

    intelligent tools that can assist in the interpretation of mined knowledge. Clearly, these

    tools need to have specific knowledge about the particular problem domain to do any

    more than filtering based on statistical attributes of the discovered rules or patterns.

    More research needs to be done in e-Commerce, bioinformatics, computer security, Web

    intelligence, intelligent learning, Database systems, Finance, Marketing, Healthcare, and

    Telecommunications by using Web usage mining.

    5. Bibl iography[AR2003]. Ajit Abhraham, Vitorino Ramos, Web Usage Mining Using Artificial Ant

    Colony Clustering and Linear Genetic Programming, to appear in CEC03

    - Congress on Evolutionary Computation, IEEE Press, Canberra, Australia,

    8-12 Dec. 2003.

    [BBAMH1999]. A.G. Bchner, M. Baumgarten, S.S. Anand, M.D. Mulvenna, J. G. Hughes,

    Navigation Pattern Discovery from Internet Data, in WEBKDD, San Diego,

    CA 1999.

    [BL1999]. Jos'e Borges, Mark Levene, Data Mining of User Navigation Patterns,


    [BM1998]. A.G. Bchner, M.D. Mulvenna, Discovering Internet Marketing Intelligence

    through Online Analytical Web Usage Mining, ACM SIGMOD, Vol. 27, No.

    4, pp. 54-61, 1998.

    [CHMSW2003]. I. V. Cadez, D. Heckerman, C. Meek, P. Smyth, S. White,

    Model-based clustering and visualization of navigation patterns on a Web

    Site, Journal of Data Mining and Knowledge Discovery, 7(4), 2003.

    (extended version of ACM SIGKDD 2000 conference paper).

    [CMS1997]. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Web Mining:

    Information and Pattern Discovery on the World Wide Web (A Survey

    Paper) (1997), in Proceedings of the 9th IEEE International Conference

    on Tools with Artificial Intelligence (ICTAI'97), November 1997

    [CMS1999]. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava, Data

    Preparation for mining world wide web browsing patterns, Knowledge

    and information Systems 1(1),1999.

    [CRHL2002]. Chi E.H., Rosien A. and Heer J., LumberJack:Intelligent Discovey and

    Analysis of Web User Traffic Composition. In Proceedings of ACM-SIGKDD

    Workshop on Web Mining for Usage Patterns and User Profiles, Canada,

    ACM press, 2002.

    [CRISP-DM]. .

    [CTS1999]. Robert Cooley, Pang-Ning Tan, Jaideep Srivastava, WebSIFT: The Web

    Site Information Filter System (1999). Proceedings of the Web Usage

    Analysis and User Profiling Workshop, August 1999.

    [GO2003]. Sule Gunduz, M. Tamer Ozsu, A Web Page Prediction Model Based on

    Click-Stream Tree Representation of User Behavior, The Ninth ACM

    SIGKDD International Conference on Knowledge Discovery and Data

    Mining. Washington, DC, USA, August 24 - 27, 2003.

    [GOO]. Google Search Engine

    [HH1999]. Robert J. Hilderman and Howard J. Hamilton, Knowledge discovery and

    interestingness measures, A survey, Technical Report, University of

    Regina, 1999.

    [JJYK1999]. Joshi K. P., Joshi A., Yesha Y., Krishnapuram, R., Warehousing and

    Mining We Logs, Proceedings of the 2nd ACM CIKM Workshop on Web

    Information and Data Management, pp. 63-68, 1999.

    [JTB2002]. Jespersean S.E., Throhauge J., and Bach T., A hybrid approach to Web

    Usage Mining, Data Warehousing and Knowledge Discovery, (DaWaK02),

    LNCS 2454, Springer Verlag Germany, pp73-82, 2002.

    [JTP2002]. Soren E. Jespersen, Jesper Thorhauge, Torben Bach Pederson, A Hybrid

    Approach to Web Usage Mining, Technical Report 02-5002, Department

    of Computer Science Aalborg University, July 2002.

    [MHD2003]. Margaret H. Dunham, Data Mining Introductory and Advanced Topics,

    Prentice Hall, 2003.

    [LL1999]. Levene, M. and Loizou, G. Computing the entropy of user navigation in

    the web, Department of Computer Science, University College London,



    [MCS1999]. Bamshad Mobasher, Robert Cooley, Jaideep Srivastava, Creating

    Adaptive Web Sites Through Usage-Based Clustering of URLs, in

    Proceedings of the 1999 IEEE Knowledge and Data Engineering Exchange

    Workshop (KDEX'99), November 1999

    [MPC1999]. Masseglia, F., Poncelet, P., and Cicchetti, R., 1999a, Webtool: An

    Integrated framework for data mining, In proceedings of the Ninth

    International Conference on Database and Expert System Application

    (DEXA99), Florence, Italy, August, 1999.

    [OKN2002]. Shigeru Oyanagi, Kazuto Kubota, Akihiko Nakase, Mining WWW Access

    Sequence by Matrix Clustering,SIGKDD Explorations. Volume 4, Issue 2,

    page 125.

    [RWC2000]. Robert W. Cooley, Web Usage Mining: Discovery and Application of

    Interesting Patterns from Web Data., A Ph. D. Thesis, May 2000.

    [SCDT2000]. Jaideep Srivastava,Robert Cooley, Mukund Deshpande,Pang-Ning Tan,

    Web Usage Mining: Discovery and Applications of Usage Patterns from

    Web Data(2000). SIGKDD Explorations, Vol. 1, Issue 2, 2000.

    [SN2003]. Smith K.A. and Ng A., Web page clustering using a self-organizing map of

    user navigation patterns, Decision Support Systems, Volume 35 , Issue 2

    (May 2003) Special issue: Web data mining, Pages: 245 256.

    [SZAS1997]. Cyrus Shahabi, Amir M. Zarkesh, Jafar Adibi, and Vishal Shah, Knowledge

    Discovery from Users Web-page Navigation, IEEE RIDE 1997.

    [SS1998]. Myra Spiliopoulou and Lukas C. Faulstich, WUM: A Web Utilization Miner, in

    International Workshop on the Web and Databases (WebDB98), Valencia,

    Spain, March 1998.

    [ZA1997]. A. Zarkesh and J. Adibi, Pathmining: Knowledge discovery in partially

    ordered databases. Submitted to KDD-1997.

    [ZXH1998]. O. R. Zaiane, M. Xin, and J. Han,Discovering Web access patterns and

    trends by applying OLAP and data mining technology on Web logs, in Proc.

    Advances in Digital Libraries Conference (ADL'98), Santa Babara, CA, April,


