discovering and mining user web-page ... and mining user web-page traversal patterns by behzad...
TRANSCRIPT
DISCOVERING AND MINING USER WEB-PAGE TRAVERSAL
PATTERNS
by
Behzad Mortazavi-Asl
B.Sc., Simon Fraser University, 1999
THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
Master of Science
in the School
of
Computing Science
© Behzad Mortazavi-Asl 2001
SIMON FRASER UNIVERSITY
April 2001
All rights reserved. This work may not be
reproduced in whole or in part, by photocopy or other means, without permission of the author.
ii
Approval
Name: Behzad Mortazavi-Asl
Degree: Master of Science
Title of thesis: Discovering and Mining User Web-Page Traversal
Patterns
Examining Committee:
Dr. Lou Hafer
Chair
Dr. Jiawei Han, Senior Supervisor
Dr. Tsunehiko (Tiko) Kameda Supervisor
Dr. Ke Wang External Examiner Date Approved: ______________________________
iii
Abstract
As the popularity of WWW explodes, a massive amount of data is
gathered by Web servers in the form of Web access logs. This is a rich
source of information for understanding Web user surfing behavior. Web
Usage Mining, also known as Web Log Mining, is an application of data
mining algorithms to Web access logs to find trends and regularities in
Web users’ traversal patterns. The results of Web Usage Mining have
been used in improving Web site design, business and marketing
decision support, user profiling, and Web server system performance.
In this thesis we study the application of assisted exploration of
OLAP data cubes and scalable sequential pattern mining algorithms to
Web log analysis. In multidimensional OLAP analysis, standard
statistical measures are applied to assist the user at each step to explore
the interesting parts of the cube. In addition, a scalable sequential
pattern mining algorithm is developed to discover commonly traversed
paths in large data sets. Our experimental and performance studies
have demonstrated the effectiveness and efficiency of the algorithm in
comparison to previously developed sequential pattern mining
algorithms. In conclusion, some further research avenues in web usage
mining are identified as well.
iv
Dedication
To my parents
v
Acknowledgments
I would like to thank my supervisor Dr. Jiawei Han for his support,
sharing of his knowledge and the opportunities that he gave me. His
dedication and perseverance has always been exemplary to me. I am
also grateful to TeleLearning for getting me started in Web Log Analysis.
I owe a depth of gratitude to Dr. Jiawei Han, Dr. Tiko Kameda and
Dr. Wo-shun Luk for supporting my descision to continue my graduate
studies. I am also grateful to Dr. Tiko Kameda for accepting to be my
supervisory committee member. He has been most generous and
understanding with his time to read this thesis carefully and make
insightful comments and suggestions. I would like to thank Dr. Ke Wang
for being my external examiner. It was with his sharing of knowledge
and experience that I was able to improve the performance of the
algorithms to the level that is now.
Many people have helped and contributed their time to the
research of this thesis. Many thanks goes to Jian Pei who significantly
influnced the direction of my research. I am grateful to him for the
countless hours of constructive discussions, the opportunities for
research collaboration, and his informative reviews of my presentation. I
offer my deepest thanks to Kum Hoe Tung who unselfishly helped me
over the years by sharing his knowledge.
I would like to especially thank all who have made my study at
School of Computing Science at Simon Fraser University possible and
have inspired me over the years and have made this experience a
memorable one.
vi
Table of Contents
Approval .......................................................................................... ii
Abstract ........................................................................................... iii
Dedication ....................................................................................... iv
Acknowledgments ............................................................................ v
1. Introduction................................................................................. 1
1.1. Knowledge Discovery and Data Mining............................. 1
1.2. Motivation ....................................................................... 2
1.3. Thesis Outline ................................................................. 5
2. Related Work ............................................................................... 6
2.1. Data Sources ................................................................... 7
2.2. Web Usage Terms ............................................................ 9
2.3. Data Preprocessing .......................................................... 11
2.3.1. User Identification ............................................... 13
2.3.2. Session Identification .......................................... 13
2.3.3. Episode Identification .......................................... 14
2.4. Web Usage Mining ........................................................... 15
3. Discovering Web Access Patterns Using OLAP .............................. 18
3.1. Introduction .................................................................... 18
3.1.1. Web Log Dimensions ........................................... 18
3.1.2. Motivation ........................................................... 20
3.1.3. Contribution........................................................ 20
3.1.4. Data set .............................................................. 21
3.2. Partially Automated Exploration of Web Log Data Cubes.. 21
3.2.1. Architecture ........................................................ 22
3.2.2. Implementation ................................................... 23
vii
3.2.3. Illustrative Example ............................................ 24
3.3. Summary......................................................................... 26
4. Sequential Analysis of Web Traversal Patterns.............................. 27
4.1. Introduction .................................................................... 27
4.2. Problem Statement .......................................................... 29
4.3. Related Work ................................................................... 34
4.3.1. GSP Algorithm..................................................... 35
4.3.2. PSP Algorithm ..................................................... 39
4.3.3. Memory Management .......................................... 40
4.3.4. SPADE Algorithm ................................................ 41
4.4. FreeSpan: Pattern Growth via frequent item lattice .......... 43
4.4.1. Basic Idea ........................................................... 44
4.4.2. Level by Level FreeSpan Algorithm....................... 45
4.4.3. Alternative-Level FreeSpan Algorithm .................. 49
4.5. PrefixSpan: Pattern Growth via frequent sequence lattice . 53
4.5.1. Basic Idea ........................................................... 53
4.5.2. Level-by-Level PrefixSpan Algorithm .................... 57
4.5.3. Pseudo Projection ................................................ 60
4.5.4. Scaling up ........................................................... 63
4.6. Summary......................................................................... 64
5. Performance Analysis and Discussions......................................... 65
5.1. Synthetic Datasets........................................................... 65
5.2. Berkeley Web Log Dataset................................................ 66
5.3. Comparison of PrefixSpan with PSP+ and FreeSpan ......... 67
5.4. Scale up .......................................................................... 72
6. Conclusions and Future Work...................................................... 74
6.1. Conclusions..................................................................... 74
viii
6.2. Future Works .................................................................. 75
6.2.1. Real-time Multidimensional Analysis ................... 75
6.2.2. Scalability ........................................................... 75
6.2.3. Incremental Algorithm......................................... 76
6.2.4. Sequential Pattern Mining with Constraints......... 76
Bibliography .................................................................................... 78
ix
List of Figures
1.1 The core steps of knowledge discovery process ........................ 1
2.1 Web Usage Data Sources ........................................................ 7
3.1 Starnet query model of a Web log data cube............................ 19
3.2 Architecture for assisted exploration of Web log data cube ...... 22
3.3 Partially automated assisted data exploration ......................... 23
3.4 Hit counts over two weeks period............................................ 24
3.5 Top drill paths using maximum standard deviation................. 25
3.6 Request hits over 24 hours of day 7 ........................................ 26
4.1 Web Access Sequence Database.............................................. 33
4.2 The GSP algorithm.................................................................. 35
4.3 PSP prefix-tree vs. GSP hash-tree structures........................... 39
4.4 SPADE Id-list intersection....................................................... 42
4.5 Transaction database of sequence database............................ 44
4.6 Frequent item lattice of running example................................ 45
4.7 Section of projection databases in FreeSpan-1 ........................ 46
4.8 The FreeSpan-1 mining algorithm........................................... 48
4.9 Frequent item matrix of running example after second pass.... 49
4.10 Projection databases in FreeSpan-2............................... 50
4.11 The alternative-level FreeSpan mining algorithm ........... 51
4.12 Prefix-based traversal of frequent sequence lattice......... 54
4.13 Projected databases of frequent 1-sequences ................. 56
4.14 The PrefixSpan-1 mining algorithm ............................... 58
4.15 Projection databases in PrefixSpan-1............................. 59
4.16 Pseudo Projection with virtual-memory window ............. 60
5.1 Distribution of Web log sequence length ................................. 67
x
5.2 Performance Comparison: Synthetic Dataset........................... 68
5.3 Performance Comparison: Synthetic Datasets ......................... 69
5.4 Performance Study: Long Synthetic Datasets .......................... 70
5.5 Performance Study: Web Log Dataset...................................... 72
5.6 Scale up: Number of Customers.............................................. 72
xi
List of Tables
2.1 Web Log Field Description....................................................... 8
2.2 Web Usage Terms and Definitions........................................... 10
2.3 Sample ECLF Web log file ....................................................... 12
3.1 First order statistics by column .............................................. 25
5.1 Synthetic data generation program’s parameters..................... 65
5.2 Synthetic data sets ................................................................. 66
1
Chapter 1
Introduction
Web Usage Mining is the automatic discovery of user access patterns
from Web servers. Organizations collect large volumes of data in their
daily operations, generated automatically by Web servers that are
collected in Web access log files. Analysis of these access data can
provide useful information for server performance enhancements,
restructuring a Web site, and direct marketing in e-commerce. In this
thesis we study the multidimensional analysis of Web logs and automatic
discovery of frequently traversed paths in Web sites.
1.1 Knowledge Discovery and Data Mining
Figure 1.1: The core steps of knowledge discovery process
Data
Data Integration
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
2
Data Mining and Knowledge Discovery in Databases (KDD) is defined as
the process of automatic extraction of implicit, novel, useful, and
understandable patterns in large databases. There are many steps
involved in the Data Mining process, which include data cleaning and
preprocessing, data integration, data selection, data transformation and
reduction, data-mining task and algorithm selection, and lastly post-
processing and interpretation of discovered knowledge [FPS 96a; FPS
96b]. This process tends to be highly interactive, incremental and
iterative. Figure 1.1 illustrates the core steps of knowledge discovery
process.
In general there are two levels of data mining. The descriptive level
is more interactive, ad-hoc and query driven [FPS 96a]. It involves the
traditional multidimensional analysis and reporting operations that are
provided by OLAP techniques, such as drilling down, rolling up, slicing
and dicing, over the existing data in data warehouses. The predictive
level data mining is however more automatic. It involves the application
of different data mining algorithms to task-relevant data to discover new
implicit patterns. The mining tasks performed by such algorithms
include: association rule mining, sequential rule mining, classification,
clustering and similarity search. In this thesis we look at the application
of both types of techniques to Web Usage Mining.
1.2 Motivation
The use of World Wide Web as the means for marketing and selling has
increased dramatically in the very recent past. Almost every major
company has their own web site that acts at least as a promotional tool
to increase awareness of the company and its products. As the e-
3
commerce activities become more important, organizations must spend
more time to provide the right level of information to their customers.
How can you tell what contents are being read, whether a web site is
effective, or how users read the information?
Web Usage Mining is the application of established data mining
techniques to analyze Web site usage. For an e-commerce company this
means detecting future customers likely to make a large number of
purchases, or predicting which online visitors will click on what ads or
banners based on observation of prior visitors who have behaved both
positively and negatively to the advertisement banners.
The sources of data involved in such an analysis may include
Content and structure of the Web site, demographic data about the users,
and Web site usage data gathered in Web logs. In this thesis we mainly
focus on the available information in Web log files. Web log files,
however, are transaction oriented and were not designed with these
questions in mind. Therefore, Web log files need to be cleaned,
summarized and transformed before processing. In Chapter 2 we look at
the available sources of Web logs and the preprocessing tasks required
for these types of data sources. We will also look at some of the recent
and interesting work done in this field.
Most server analysis packages lack the ability to provide any true
business insights about visitors’ online behavior. Current traffic analysis
tools, like Accrued, Andromedia, HitList, NetIntellect, NetTracker, and
WebTrends provide high-level predefined reports about domain names, IP
addresses, browsers, cookies, and other server activities. These types of
reports aim at providing information on the activity of the server rather
than the user. Because of the time variant and multi-dimensional
4
nature of the Web logs, we could leverage on the existing OLAP
technology and provide a multi-dimensional browsing capability [ZXH98].
In Chapter 3 we explore the multidimensional analysis of Web log files,
which could also provide insight into the user behavior.
In addition, detecting user navigation paths and analyzing them
may result in a better understanding of how users visit a site, identify
users with similar information needs, or even improve the quality of
information delivery in WWW using dynamic or personalized web pages.
Unfortunately, it is difficult to perform such user-oriented data mining
directly on raw user access logs, as these logs tend to be error-prone
(missing fields or values), incomplete and ambiguous. They are
incomplete due to the presence of cache hits, proxy servers and stateless
nature of HTTP transport protocol which makes the task of identifying
users and their visits less percise. They are ambiguous as they are
transaction oriented and were designed to simply record the access of
each server resouce but we need to extract abstractions that are suitable
for Web user characterization analysis. They also lack the business
information required as a frame of context. Therefore, it requires a
delicate preprocessing stage to clean and prepare the data for Web
traversal analysis. We will expand on these shortcomings more in
Chapter 2, and in Chapter 4 we will develop a sequential pattern mining
algorithm suited for the long sequences and large data sets that can be
generated from the Web log files.
In those Chapters we will explain in more detail our motivations for
each part of the work.
5
1.3 Thesis Outline
This thesis is organized into 6 chapters. Chapter 2 introduces Web
Usage Mining and the most recent research related to our work. In
Chapter 3 we present our implementation of the multidimensional
analysis of Web log files. In Chapter 4 we analyze the existing sequential
pattern mining algorithms and develop a fast and efficient algorithm for
path traversal analysis of Web logs. In Chapter 5 we present our
experimental results and discuss the limitations and strengths of our
approach. In Chapter 6 we summarize the technical contributions of
this thesis and present possible directions for future research.
6
Chapter 2
Related Work
Web Mining is the application of data mining techniques to content or
activities related to WWW. Zaïane in his PhD thesis [Z99] classifies Web
Mining into three domains: Web Content Mining, Web Structure Mining
and Web Usage Mining. Web Content Mining is the process of extracting
knowledge from the context of Web sites, for example contents of
documents or their description. Web Structure Mining, on the other
hand, uses the links and references in Web pages to infer interesting
knowledge, such as identifying authoritive- and hub-pages. Authoritive
or high quality pages are inferred by the number of hyperlinks pointing
to the page by different page authors on the Web. A hyperlink in this
case is considered as the author’s endorsement of the other page. The
collective endorsement of a given page by different authors may indicate
the importance of the page. On the other hand, a hub is a Web page or
set of Web pages that provide a collections of links to authoritive-pages.
Web Usage Mining, also known as Web Log Mining, mines Web access
logs for interesting patterns in WWW traffic. In this chapter, we will
review the recent advances in the field of Web Usage Mining. In Section
2.1 we introduce the available data sources for such a study. Section 2.2
identifies the published Web usage terms. The required preprocessing
steps are presented in Section 2.3. Finally, we study some of the
interesting works done in Web Usage Mining in Section 2.4.
7
2.1 Data Sources
Any type of Web usage mining requires having an accurate picture of the
WWW traffic. This section explores the available data sources and their
properties. As shown in Figure 2.1 the data sets commonly used for Web
usage mining are collected at the server-level, proxy-level or client-level.
Each data source differs in terms of format, accuracy, scope and method
of implementation.
Figure 2.1: Web Usage Data Sources
Client-level logs hold the most accurate account of user behavior
over WWW. They can be implemented using applets, Java-Scripts,
cookies and modified browsers and by far are the most controversial
methods in terms of user privacy issues [SZA+97, EV97]. If a client
connection is through an Internet Service Provider (ISP) or is located
behind a fire-wall, its activities may be logged at this level. The primary
function of proxy servers or fire-walls is to serve both as a measure of
security to block unwanted users or as a cache resource to reduce
network traffic by reusing their most recently fetched files. Their log files
may include many clients accessing many Web servers. In the log files,
their client request records are interleaved in their received order. The
process of logging is automatic and requires less intervention compared
Client-level logs Proxy-level logs Server-level logs
8
to client-level logging. Its format is dependent on the logging software.
Its accuracy is however diminished by the client-level cache as some
requests are not received by the proxy and are served from the most
recently fetched files stored at the client computer. Server-level logs are
the most commonly used source for Web usage mining. Most Web
servers provide the option of storing log files in the Common Log Format
(CLF), Extended Common Log Format (ECLF) or proprietary format
[LOU95]. Table 2.1 describes the fields in both formats.
Table 2.1 Web Log Field Description
Every CLF log entry has the following format:
HostID rfc931 authuser [date offset] “method URI protocol” status bytes
ECLF logs in addition also contains referrer page and agent fields.
Following is a sample of ECLF log entry. Note fields rfc931 and authuser
are empty (indicated by value ‘-’).
142.107.30.180 - - [25/Mar/1999:23:01:41 –800] “GET my.html HTTP/1.1” 200
4219 /Index.html Mozilla(IE4.2, Win95)
However, there exist some drawbacks to using server-level Web logs.
Term Description Remote host remote host name or IP address. Rfc931 remote login name of the client. Auth. User server authenticated client name. Date date and time of request. Offset local time offset from Greenwhich time Method Method of request (Get, Post, Head, etc). URI full page address or request as it came from the client. Protocol http communication protocol used by the client. Status http server status sent to the client. Bytes number of bytes transferred. Referrer URI that request originated from. Agent OS and browser software at the client.
9
x� Due to client and proxy-level caches, not all hits are captured in
the server-level Web logs.
x� Page view time duration may be inaccurate. If a hit is answered by
a client or proxy cache (cache hit), then the view time of its
previous page will be interpreted to be longer than it actually was.
x� When the user id is not available and clients are behind a proxy,
many hits will be recorded with the same host name (proxy’s host
name). As a result, page views may seem to be erratic, with a very
short viewing times.
One method to reduce cache hits is to ensure page view name (URL) or
the request line is unique across different visits. This can be
accomplished by dynamically generating each Web page and appending a
unique session id to each hyperlink Web page name. For a complete
discussion of the shortcomings of the current log standard and potential
solutions the interested reader can refer to [P97].
2.2 Web Usage Terms
In order to create some consistency in the discussions that follow, we
adopt the Web term definitions published by World Wide Web Committee
Web Usage Characterization Activity (W3C WCA) [WCA99] that is relevant
to Web usage mining. They are listed in Table 2.2.
A user is identified as an individual or an automated application,
such as a Web Crawler, that is accessing files from different servers over
WWW. As mentioned in the previous section, due to proxy servers many
users may have the same host name and in most cases the rfc931 and
authuser fields in Web logs may be empty. In such cases, only the host
10
name with the combination of agent information, if available, is used to
identify users.
Term Description Server A role adapted by an application supplying resources. Proxy An intermediary application acting both as a server
and client. Client A role assumed by an application when retrieving
resources from the server. User A person using a client application to interact and
retrieve resources from the server. User session Set of user clicks across one or more servers. Server session Set of user clicks as recorded by a single server (also
known as visit). Episode Subset of related user clicks in user session. Web page Collection of resources identified by a single URI. Page view The rendered Web Page in a specific client
application. Click-stream A sequential series of page views by a user.
Table 2.2 Web Usage Terms and Definitions
A page view consists of a set of resources, such as one or more
html files, graphics, etc., used to render a requested URI in user’s
browser. Each resource will be logged separately in a Web log as they
are delivered. A page view is usually the result of a user’s single mouse
click on a hyperlink. Most page views contain multiple frames, html and
graphics files, some of which may be used in multiple page views. This
makes the job of identifying a page view more difficult. In addition, more
and more Web pages are generated dynamically using, for example, a cgi
with different parameters. In these cases, we must also take into
consideration the parameters of the call in order to map each request to
a different page view.
11
A click-stream is defined as a time-ordered list of page views.
User’s click-stream over the entire Web is called the user session.
Whereas, the server session is defined to be the subset of clicks over a
particular server by a user, which is also known as a visit. As mentioned
in the previous section, in the presence of cache hits it is difficult to
reconstruct an accurate picture of the user’s click-stream.
Within the click-stream of a visit or a user session, an episode is
defined as a set of sequentially or semantically related clicks. This
relation depends on the goals of the study.
2.3 Data Preprocessing
One of the important core steps of knowledge discovery is data
preprocessing. The main goal of this step is to create minable objects for
knowledge discovery despite the presence of ambiguities and
incompleteness in data. This step is highly data-source dependent. The
techniques used to overcome these shortcomings may vary greatly from
one data source to the other. Therefore, in this section we focus on
techniques used to preprocess server-level Web access log files, namely
CLF and ECLF. A sample ECLF Web log is shown in Table 2.3.
In the previous section we mentioned some of the shortcomings of
using Web access log files with respect to W3C WCA Web usage terms.
In general, there are three tasks to be performed in the preprocessing
stage: User, session and, optionally, episode identification.
12
142.59.243.3
142.59.243.3
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
142.59.243.146
Host name
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Rfc931
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Auth. User
[08/Feb/2001:19:53:18 –800]
[08/Feb/2001:19:53:06 –800]
[08/Feb/2001:19:53:01 –800]
[08/Feb/2001:19:52:45 –800]
[08/Feb/2001:19:52:34 –800]
[08/Feb/2001:19:52:34 –800]
[08/Feb/2001:19:52:25 –800]
[08/Feb/2001:19:52:24 –800]
[08/Feb/2001:19:51:04 –800]
[08/Feb/2001:19:51:07 –800]
[08/Feb/2001:19:51:07 –800]
[08/Feb/2001:19:51:06 –800]
[08/Feb/2001:19:51:04 –800]
[08/Feb/2001:19:51:04 –800]
[08/Feb/2001:19:51:04 –800]
Date
“GET F.html HTTP/1.1”
“GET G.html HTTP/1.1”
“GET 2.cgi? HTTP/1.1”
“GET B.html HTTP/1.1”
“GET H.html HTTP/1.1”
“GET A.gif HTTP/1.1”
“GET I.html HTTP/1.1”
“GET E.html HTTP/1.1”
“GET C.html HTTP/1.1”
“GET D.html HTTP/1.1”
“GET A.gif HTTP/1.1”
“GET I.html HTTP/1.1”
“GET B.html HTTP/1.1”
“GET A.gif HTTP/1.1”
“GET I.html HTTP/1.1”
Method/URL/Protocal
200
200
301
200
200
304
200
404
200
200
304
200
200
200
200
Status
2937
1680
152
1300
2762
-
1152
-
1210
1814
-
1152
1300
2890
1152
Bytes
http://www.cs.sfu.ca/B.html
http://www.cs.sfu.ca/B.html
-
http://www.cs.sfu.ca/A.html
http://www.cs.sfu.ca/I.html
http://www.cs.sfu.ca/index.html
http://www.cs.sfu.ca/index.html
-
http://www.cs.sfu.ca/D.html
http://www.cs.sfu.ca/I.html
http://www.cs.sfu.ca/index.html
http://www.cs.sfu.ca/index.html -
http://www.cs.sfu.ca/I.html
http://www.cs.sfu.ca/index.html
http://www.cs.sfu.ca/index.html
Referrer
Mozilla/3.0 (IE5.0; WinNT)
Mozilla/3.0 (IE5.0; WinNT)
Mozilla/3.04 (Win95,I)
Mozilla/3.04 (Win95,I)
Mozilla/3.04 (Win95,I)
Mozilla/3.0 (IE5.0; WinNT)
Mozilla/3.0 (IE5.0; WinNT)
Mozilla/3.0 (IE5.0; WinNT)
Mozilla/3.04 (Win95,I)
Mozilla/3.04 (Win95,I)
Mozilla/3.04 (Win95,I)
Mozilla/3.04 (Win95,I)
Mozilla/3.0 (IE5.0; WinNT)
Mozilla/3.0 (IE5.0; WinNT)
Mozilla/3.0 (IE5.0; WinNT)
Agent
Tab
le 2
.3 S
ample
EC
LF W
eb log
file
13
2.3.1 User Identification
In the best case, we can rely on the values in fields rfc931 and/or
authuser to accurately identify a user. But in most cases, fields rfc931
and authuser are empty. In the absence of such information, host name
and user agent information are the only available choices to identify a
user. This is assuming every user has a unique IP address within which
only one type of browser is operated. However, this is not necessarily
correct. As stated in [SCD+00], the following are the cases that break the
assumption.
x� As shown in Figure 2.1, several users may access a server
through a single proxy potentially at the same time.
x� Some ISPs or privacy tools randomly assign an IP address to
each user’s request.
x� Some repeat users access the Web each time from a different
machine.
x� A user may operate many browsers of different types at the
same machine and potentially at the same time.
Given the sample Web log shown in Table 2.3 and using the combination
of host name and agent fields, IP address 142.59.243.146 is responsible
for two users (lines 1-3, 8-10 and 4-7, 11-13) and IP address
142.59.243.3 for one (lines 14-15).
2.3.2 Session Identification
Once a user has been identified or approximated, the ordered click-
stream of each user must be divided into server-sessions or visits. It is
done by identifying the last page view of each visit. Without the presence
14
of explicit sign-out event or access to complete user-session, page view
time can be used to determine if a user has continued the same visit by
selecting another page. Catledge and Pitkow in [CP95] have studied user
page view time over WWW and have recommended thirty minutes of
inactivity as an indication of sign-out event. Note that since a user may
not be interested only in the pages of one site or potentially leave and re-
enter the same site at different intervals, session identification may also
become a difficult task.
2.3.3 Episode Identification
Finally after the user session has been identified we could optionally
break it into semantically meaningful subsets as episodes. Maximum
Forward Reference is such a relation [WYB98]. It is defined as a set of
page views up to the page view where backward reference is made. For
example, for a user session that contains the following ordered pages A-
B-A-C-D-C, maximal forward references for the session would be A-B and
C-D. In an e-commerce site an episode can also be defined using page
type and site structure, or particular action, such as clicking an
advertising icon or adding an item to customer basket. It is based on the
motivation of finding the sequence of pages that lead to a particular
action or page. Cooley et al. [CMS97] also introduces Reference Length
Module and Time Window Module as means of episode identification. The
former only allows a set maximum view time for each page, based on its
classified page type of navigation or content. A new episode is started
when the view time of a page exceeds its maximum set view time. The
latter, on the other hand, only allows a maximum time window for each
15
episode. It is based on the assumption that a meaningful episode has an
average time length associated with them.
2.4 Web Usage Mining
Having introduced the data sources, the terms and the required
preprocessing steps for Web Usage Mining, we now turn our attention to
the recent advances in the field of Web Usage Mining. In the recent
years there has been an increasing number of research work done in
Web usage mining [MT96, YJM+96, C97, CMS97, CPY98, ZXH98,
WYB98, JJK99, BBA+99, CMS99, MPC99, MPT99, PHM+00, BL00a,
BL00b, KB00, MPT00, SCD+00]. The main motivation of these studies is
to get a better understanding of the reactions and motivations of
customers who shop through electronic premises of a company in WWW
or of users who are simply browsing these premises. Some studies also
apply the mining results to improve the design of Web sites, analyze
system performance and network communications or even build adaptive
Web sites. In general, there are two main goals in the application of
discovered knowledge in Web usage mining: General Access Pattern
Tracking for understanding access patterns and trends and Customized
Usage Tracking for adapting and personalizing browsing experience for
the users. The former is the goal of this study.
One of the first works in Web log analysis belongs to [YJM+96],
where non-sequential and weighted vector of visited pages of a user is
used to assign the user to an existing cluster of users. Then system
dynamically suggests links based on the set of visited pages by other
users in the same cluster. Authors in [MT96] consider Web log as a
single sequence of events and propose an algorithm to discover frequent
16
event sequences, where fields in a Web log record are different attributes
of an event. The authors in [CMS97, CPY98, WYB98] present algorithms
to identify episodes in user sequences as discussed in the previous
section and apply association and sequential pattern mining algorithms
respectively to mine the user episodes. In Chapter 4 we will present in
more detail the related work in sequential pattern mining.
Since the size of Web log files grows quite rapidly, normal
algorithms may not be scalable. In addition, the contents of most web
sites change over time, rendering some parts of Web logs irrelevant for
the current analysis. Also the goal of analysis may change over time with
the business needs. Therefore, on the one hand, there is a need,
perhaps both in terms of memory and disk space, for scalability
improvements, and, on the other hand, for the introduction of
constraints into mining algorithms, such as time constraints, concept
hierarchies and frequent pattern templates, for discovering relevant and
correct knowledge.
One scalable and flexible approach to mining Web log files is with
the use of data warehouses, data cubes and OLAP techniques. On-Line
Analytical Processing (OLAP) and data cubes [GBL+96] have recently
become accepted as powerful tools for strategic analysis of databases in
business settings. Han et al. [ZXH98] has shown that some of the
analysis needs of Web usage data can be done using data warehouses
and OLAP techniques. In the preprocessing stage of this approach, the
parser does not filter out any records. All fields of the Web log records
are loaded into a relational table. As Web log data tend to be very large,
with some level of summarization, ad-hoc query and analysis using OLAP
activities such as roll-up and drill-down of significant amount of data
17
can also be feasible. In Chapter 3 we will present our work on
multidimensional analysis of Web log files.
There are also several commercially available Web server log
analysis tools, such as [WEB95, GEN96, MIN00]. Webtrends provides a
limited reporting mechanisms, such as user and page access statistics.
On the other hand, NetGenesis and Easyminer are more comprehensive
and include sequential and clustering algorithms as well as various
visualization packages into their product. However, they do not allow
integration of Web server log data with existing business data as in
[ZXH98]. Their algorithms are questionable in terms of scalability and do
not provide the facilities for verification-driven data mining such as
OLAP.
18
Chapter 3
Discovering Web Access Patterns Using OLAP
In this chapter we present the design and implementation of our Web log
data mining system for multi-dimensional analysis of Web log data. It is
a simple yet scalable and effective method for such an analysis. For this
purpose, we use the following data mining techniques: data cubes
[GBL+96], On-Line Analytical Processing (OLAP) [C93, CD97], and a
partially automated discovery-driven method for data exploration.
3.1 Introduction
Web log data cubes are constructed to give the user the flexibility of
viewing data from different perspectives and performing ad hoc analytical
quires. A typical Web log ad hoc analysis example is querying how
overall usage of Web site has changed in the last quarter, testing if most
server requests have been answered, hopefully with expected or low level
of errors. If some weeks or days are worse than the others, the user
might navigate further down into those levels, always looking for some
reason to explain the observed anomalies. At each step, the user might
add or remove some dimension, changing their perspective, select subset
of the data at hand, drill down, or roll up, and then inspect the new view
of the data cube again. Each step of this process signifies a query or
hypothesis, and each query follows the result of the previous step.
3.1.1 Web Log Dimensions
Figure 3.1 depicts a Starnet query model of a Web log data cube. This
starnet shows 10 dimensions and four measures that can be constructed
19
directly from Web log, where the hierarchies built for file dimension (URL
of the requested resource) are based on the site’s directory structure and
file type, action and host address dimension’s hierarchies are predefined.
Figure 3.1: Starnet query model of a Web log data cube
For more business oriented analysis, other attributes from other data
sources, such as marketing and sales databases, can be added to this
model. Additional attributes may include user demographic data, such
as address, age, race, education, and income and business data, such as
product, revenue and marketing strategies or class, conference, tutorial
and instructor can also be added.
minute hour day week year Time
dir level 1
File
dir level 2 dir level 3
file name
level 1
level 2
level 3
File type
file ext.
level 1
level 2
host name
Host/IP address Method
method name
protocol name
Protocol
status code Server status
User
user id
Agent
agent
Action Measures
byte size
hits
view time
session count
level 1
level 2
level 3
20
3.1.2 Motivation
OLAP data cubes, especially Web logs, are often too large to browse for
the purpose of decision support, even after some level of summarization
of the raw data. In the paradigm of assisted discovery-driven exploration
of data, researchers focus on full or partial automation of data
exploration process [K95, SAM98, S99, S00, SS00]. Often an analyst
explores the OLAP data cubes looking for small regions of data that are
indicative of rare events, exceptions or new opportunities. But
sometimes, the analyst explores the data cubes to simply monitor strong
existing trends or to look for new ones. Typically, the user starts by
selecting a set of dimensions at a certain aggregation level, and visually
inspects the data or its graphs. At this point the user may be modifying
his/her hypothesis or needs, or forming new ones. Based on that, the
user changes his/her footprint on the Starnet model and inspects the
new view of the data cube again, and so on. This is a daunting task and,
without help, the user can easily overlook a potential discovery or take
the wrong turn at any point.
3.1.3 Contribution
During such explorations, a user normally can only analyze and
understand a few dimensions at a time. Therefore, most of the existing
works aim at fully automating the process by pre-mining the data. They
provide different statistical models, such as diviation from norm using
multivariate analysis, to find exceptions, taking into account the effects
of multiple dimensions and their hierarchies. However, most users are
not Statistics experts to understand the meaning of such exceptions.
21
Our aim is to partially automate the process by providing simple aids
based on first-order Statistics functions, so the user can make more
informed decision at each step of the exploration. We have developed a
Cell Statistics Annotation Add-in component that extends the functionality
of Microsoft Excel’s PivotTable and has been integrated into the OLAP
mining system of DBMiner [HFW+96, HCC+97].
3.1.4 Data set
In this work we have selected the Web access logs gathered by Computer
Science Department of University of California Berkeley, as our test data
set. The Web logs span over the first two weeks of June 1999 and
display the following characteristics: 165 MB in size, 47,472 unique host
names, 48,600 unique URLs, 296 different file types (extensions) and
900,037 requests. Using the file extensions we have grouped the files
into 9 different categories of audio, compressed, document, dynamic,
html, images, java and video. For those IP addresses that have been
resolved into host names, we have grouped them according to their
rightmost two or three characters into different Internet domains. The
following 7 dimensions and two measures have been used to construct a
Web log OLAP data cube using Microsoft OLAP Server: time, file, file type,
host, method, protocol and server status dimensions and bytes and hits
measures.
3.2 Partially Automated Exploration of Web Log Data Cubes
Motivated by the wide spread popularity of Microsoft Excel, its PivotTable
and Microsoft OLAP Server, which facilitates powerful multidimensional
22
analysis of data warehouses, we have developed a Cell Annotation Add-in
component for Microsoft Excel.
3.2.1 Architecture
An Add-in component is written in VBA (Visual Basic for Applications)
language for a particular Microsoft Office application where after loading
the component extends the functionalities of that application.
Figure 3.2: Architecture for assisted exploration of Web log data cube
As shown in Figure 3.2, in this architecture the cell annotation add-in
component exists only within the confines of Microsoft Excel application,
relying on the objects that Excel exposes and its own GUI objects to
communicate with the user. PivotTable is another Microsoft object that
Web log files
Database
Data cleaning & transformation
Data warehouse
Data Cube
OLAP Server
PivotTable API
Cub files
Filtering Data Integration
Excel User GUI API
Add-in GUI API
Cell Statistics Annotation Add-in
23
allows multidimensional view of data cubes. We use the embedded
PivotTable object within Excel and its API to query the underlying data
sources, be they OLAP Server data cubes or cub files (off-line data
cubes). Currently, the tasks of cleaning Web logs and creating data
cubes are done by separate processes.
3.2.2 Implementation
The Add-in component provides two types of functionalities: “statistics on
current page” and “assisted drilling”. Figure 3.3 shows the interface for
these functionalities.
Figure 3.3: Partially automated assisted data exploration
Current page statistics provides a summary of different regions of the data
(row, column or plane) that is displayed for the current footprint of
starnet query model. The summaries include min-max, median,
standard deviation, variance and average by row, column and plane. The
summary results and annotated cells, where applicable, help the user to
24
understand the current data. Assisted drilling provides the means for the
user to choose a drill path. The statistical functions included for this
functionality include: min, max, min median, max median, min variance,
max variance, min std dev, max std dev, min average, and max average.
Once the user has selected a statistical function and dimensions for
drilling, the component performs the drill function on all the members of
the selected dimensions and uses the data in that plane to calculate the
value for the chosen statistical function. By comparing all the calculated
values, Assisted drilling then highlights up to five cells, representing the
potential drill paths.
3.2.3 Illustrative Example
Figure 3.4 shows part of the Berkeley’s Web log cube. In it the Excel
table shows part of the cube with 2 dimensions: DateTime and
ServerStatus, where DateTime is at level Day. Server status 200 indicates
a successful request, while 403 and 404 mean request for forbidden URL
and requested URL was not found, respectively.
Figure 3.4: Hit counts over two week period
25
When browsing this view, the user may pay special attention to areas
that display anomalous trends. We can easily notice hits during week 24
start up much lower than week 23 but sharply increase over the next two
days and then decrease dramatically. Using the program, we gather the
first order statistics of the current view by column which are summarized
in the following table.
Function 200 403 404
Min 18,221 93 631 Max 105,322 2,281 7,219 Median 83,264 & 83,245 763 & 782 3,712 & 3,810 Average 75,906.2 840.8 3,732.1 Std. dev. 27,754.8 585.0 1,654.7
Table 3.1: First order statistics by column
We can see the data in all columns are skewed and the hit count for
server status 403 and 404 are unusually high in day 7 with respect to
successful hits. Instead of drilling down on all available paths and
inspecting their standard deviation, the user can invoke the program to
find them. Figure 3.5 highlights the top 5 such drill paths.
Figure 3.5: Top drill paths using maximum standard deviation
26
Drilling down along the DateTime dimension over day 7 yields 72
different hit counts as shown in Figure 3.6. Notice how the error hits are
high over a period of three hours. This shows how our partially
automated data exploration method, even though not very complex, can
alleviate the problems associated with data exploration.
Figure 3.6: Request hits over 24 hours of day 7
3.3 Summary
Many Web log analysis tools only provide a simple set of statistics, such
as hit counts and distributions based on time, file and geographic
regions. Wwwstat (http://www.ics.uci.edu/pub/websoft/wwwstat) and
Analog (http://www.statlab.cam.ac.uk/~sret1/analog) are among this
type of tools. We believe this type of analysis still provides some benefit
to the user. Therefore, in this chapter we presented a system that
combines multidimensional data cube, OLAP and partially automated
discovery-driven method for data exploration technologies to provide a
more scalable and powerful tool for such analysis.
27
Chapter 4
Sequential Analysis of Web Traversal Patterns
After cleaning the Web log files, sequential pattern discovery techniques
[MTV95, SA96] can be applied to discover [-frequent patterns. In this
chapter we present a new efficient and scalable sequential pattern-
mining algorithm, called PrefixSpan with Pseudo-Projection.
4.1 Introduction
The mining of sequence of Web traversal patterns is to discover a set of
attributes, such as Web page view, shared across time among a large
number of user sequences in a given Web access log database. For
example, consider the server-level Web access log database, where the
desired attributes represent user’s Web page view, and each record
represents each Web user’s time-ordered set of visits over a period of
time. The discovered patterns are the page views most frequently
accessed within and across visits by the Web users. For example:
“The Data Webhouse Toolkit” book’s page view occurs before the
“Data Mining Your Website” book’s page view 71% of the time in
the same visit.
The mining task of finding all frequently occurring traversal [-patterns in
a large database is quite challenging. As proposed in [Z98, A01] the
search space can be identified with the powerset � �PA2 , where A is the set
of desired attributes in database and P is the length of the largest [-
frequent Web user sequence that can be discovered. In popular Web
sites such as Yahoo.com and amazon.com with millions of repeat
customers and thousands of available Web pages, the length of
28
sequences grow very long, increasing the search space exponentially.
Furthermore, potentially not all records in the database contribute to the
support of [-frequent sequences. Therefore, in large databases there also
exists the problem of minimizing disk I/O. Yet, most sequential pattern
mining algorithms [SA96, CPY98, MCP98] are iterative and step-wise in
nature, where after generating longer candidate sequences from the
previously found [-frequent sequences the database is scanned again in
full to check the support of each candidate sequence. These algorithms
use large pointer based dynamic data structures, such as hash trees and
lists, where the task of choosing the place the memory is allocated from
is left to the operating system. It is observed that the access patterns of
these recursive data structures are highly ordered. Therefore, with
respect to access patterns of the internal data structures, the data
locality tipcally is not optimal in these algorithms [PZL98].
In this chapter, we present a new algorithm to overcome these
limitations. The main contributions are as follows:
1. We use a similar technique to that of vertical id-list database
format proposed in [Z98], called pseudo-projection to manage
projection-databases and yet maintain high data locality.
2. We introduce a new measure, called selectivity to control physical
materialization of projection-databases, thereby minimizing
unnecessary disk I/O.
Our algorithm not only strives to minimize I/O cost and maintain high
data locality but also reduces the search space by applying prefix
pattern-growth method, which will be described in detail in Section 4.5.1.
In comparison to the SPADE algorithm of [Z98] that requires a
preprocessing overhead to convert the database to the vertical format
29
and is only efficient when database fits into memory, our method can
also handle gigabyte size databases.
The rest of the chapter is organized as follows: In Section 4.2 we
describe the problem of mining sequences of Web traversal patterns,
followed by Section 4.3, where we look at the related work. We look at
our previously developed algorithm in Section 4.4 and then introduce our
new algorithm in Section 4.5. Finally, we conclude this chapter in
Section 4.6 by summarizing our work.
4.2 Problem Statement
Each Web server registers every request it serves in a Web log file. Each
entry of Web log represents an access to a single Web server resource,
such as an image or html file. As explained in Chapter 2, a Web log file
generally follows the Common Log Format (CLF) [LOU95]. In association
rule mining [AIS93, AS94], a transaction is defined as a set of items
bought by a customer in a single purchase. The sequential pattern
mining later introduced in [AS95, SA96] defines a sequence as a time-
ordered set of transactions, whereas in sequential mining of Web
traversal patterns, each Web log entry is a separate customer
transaction. Like Cooley in [C00], we identify a user visit as a set of page
views that are sufficiently close over time by using a maximum time gap
� �t'max specified by the user. We identify a page view as an html or a
dynamically generated file that is sufficiently apart over time from the
previously identified page view by using a minimum time gap � �t'min
specified by the user. A Web access sequence is then defined as a time-
ordered set of visits.
30
Definition 4.2.1: Let ^ `nrrrR ,,, 21 � be the set of ids for the available
resources in a Web site. Let ^ `mpppP ,,, 21 � be the set of possible
events or page views identified in a Web site, where ^ `isiii rrrp ,,, 21 � , and
for Rrsjmi ij �dddd ,1,1 . Let ^ `qlllL ,,, 21 � be a set of log entries (Web
access transactions) in the server Web access log file. Without loss of
generality, we define each Ll � as a triple resourceidtimeuseridl ,, , where
Rresourceidl �. .
In an ideal situation, we can detect each user’s page view from the server
Web access log. However, due to the presence of cache-hits not all hits
are recorded in L. Therefore, we need to approximate a page view.
Definition 4.2.2: Let ^ `meeeE ,,, 21 � be the set of identified Web log
page views or events, where each Ee� is defined as a triple,
� �iuiii lllhitstimeuseride ,,,,, 21 � , such that for mi dd1 , uj dd1 , Llij � &
userideuseridl iij .. & timeltimel jiij .. )1( �d & ttimeltimel ijji 'd�� min..)1( &
� �timeltimee ijuji .max. 1 �� . Let user visit be defined as a triple
� �qeeetimeuseridv ,,,,, 21 � , where for useridVuserideqi i ..,1 dd &
timeetimee ii .. 1�d & ttimeetimee ii 'd�� max..)1( & � �timeetimev iqi .max. 1 �� . A visit
with k page views or events is said to have length kv or be a k-visit.
Note that page views are the basic minable units in Web log visits not the
resource hits. Figure 4.1 shows the user visits of our running example.
Definition 4.2.3: User Web access sequence is a time-ordered set of
visits. It is defined as a triple � �nvvvtimeuserids ,,,,, 11 � , where for
ni dd1 , useridsuseridv i .. & timevtimev ii .. 1�� & � �timevtimes ini .max. 1 �� . An
access sequence s that consists of k page views ( ¦�
n
jjvsk
1
|| ) is also
31
called a k-sequence. Then Web Access Sequence database is defined as
^ `msssWAS ,,, 21 � , where each WASs� is a user Web access sequence.
As a comparison with sequential pattern mining problem defined in
[SA96], user Web access sequences are made up of time-ordered item
sets or elements where each item is a page view or an event. But items
are not necessarily unique in an element. Repetition is allowed within
elements of a Web access sequence. For example, � �2,1,1 and � �2,1 are two
different elements.
Definition 4.2.4: Sequence of events in visit jeeeV ccc c ,,, 21 � is a
subsequence of visit � �,,,, 21 kjeeeV k d � and V a super-sequence of V c ,
denoted as VV v%c , if and only if there exist integers kiii j d���d �211
such that jijii eeeeee c c c ,,,
21 21 � . Event Sequence V c is a proper
subsequence of V , denoted as VV v%c , if and only if VV v%c and VV zc .
Similarly, access sequence mvvvS ccc c ,,, 21 � is a subsequence of access
sequence nvvvS ,,, 21 � � �nm d , and S a super-sequence of Sc , denoted
as SS s%c , if and only if there exist integers niii m d���d �211 such that
mivmiviv vvvvvv %�%% ccc ,,,21 21 . Access Sequence Sc is a proper subsequence of
S , denoted as SS v%c , if and only if SS v%c and SS zc . Subsequence Sc is
also called the prefix of S , if and only if, for ii vvmi c�d ,1 and for
,,,, 21 jm eeev ccc c � km eeev ,,, 21 � � �kj d , jl dd1 , ll ee c . From here on we
will refer to relations ,,, vsv %%% or s% simply as ,% or % , respectively,
when the distinction is clear from the context.
For example, let us consider that a given Web user has accessed the
following page views 1, 2, 3, 4, 5, according to the following sequence:
� �� �� �� �543,21 S . This means that only page views 2 and 3 were accessed
32
in the same visit, and since 5 S , the access sequence is a 5-sequence.
Access sequence � �� �21 is a subsequence of S , because � � � �11 % and
� � � �3,22 % . However, the sequence � �� �32 is not a subsequence of S .
Access sequence � �� �21 is also a prefix of S , but sequence � �� �31 is not.
Our aim is to detect frequent Web traversal [-patterns. For such a
task, we require to discover the sequence S in WAS database with a
support value above the user-defined support value.
Definition 4.2.5: Let the support of access sequence S in
^ `msssWAS ,,, 21 � database be defined as � � ^ `m
sSsS ii
WAS
%|sup and also
be denoted as � �Ssup if the database is clear from the context. Given a
user-defined minimum support value ([), sequence S is said to be
frequent or a [-pattern if the condition � �tSsup [ holds. A [-frequent
sequence S is called maximal if there exists no [-frequent sequence S c
that SS c% .
Note that any given sequence in the database only contributes to the
support of sequential pattern S once, even if the subsequence S occurs
many times in that sequence.
Problem Statement: Given a Web access sequence database and a
minimum support threshold [, the problem of Web traversal pattern
mining is to enumerate the complete set of [-frequent sequences.
Running Example: Consider the Web access sequence database shown
in Figure 4.1 that is used as a running example in this chapter. The
database contains the following set of page views {1, 2, 3, 4, 5, 6, 7, 8},
which has been identified in a given Web site, has 4 users, and ten visits
in total. The figure also shows 0.5-frequent 1-, 2-, 3-, and 4-sequences
33
Figure 4.1: Web Access Sequence Database
Userid Time Visit 1 10 3,4
1 15 1,2,3
1 20 1,2,6
1 25 1,3,4,6
2 15 1,2,6
2 20 5
4 10 4,7,8
4 20 2,6
4 25 1,7,8
3 10 1,2,6
User Visits Web Access Sequence Database
25 (3,4)(1,2,3)(1,2,6)(1,3,4,6)
2 20 (1,2,6)(5)
3 10 (1,2,6)
4 25 (4,7,8)(2,6)(1,7,8)
1 Userid Time Sequence
4
(2) 4
(4) 2
(6) 4
(1)
Frequent 1-Sequnces
Sequence Sup
3
(1,6) 3
(2)(1) 2
(2,6) 4
(1,2)
Frequent 2-Sequnces
Sequence Sup
(4)(1) 2
(4)(2) 2
(4)(6) 2
(6)(1) 2
2 (4)(2,6)(1)
Frequent 4-Sequnces
Sequence Sup
3
(2,6)(1) 2
(4)(2,6) 2
(4)(2)(1) 2
(1,2,6)
Frequent 3-Sequnces
Sequence Sup
(4)(6)(1) 2
{ }
(1)
(1,6)
(2) (4) (6)
(2,6) (2)(1) (4)(1) (4)(2) (4)(6) (6)(1) (1,2)
(4)(6)(1) (1,2,6) (2,6)(1) (4)(2)(1) (4)(2,6)
(4)(2,6)(1)
34
that can be found in the database with a minimum support threshold of
50% or 2 sequences. It is also clear from the generated lattice of 0.5-
frequent sequences that there exists two maximal 0.5-frequent
sequences, namely � �6,2,1 and � �� �� �16,24 .
4.3 Related Work
The sequential pattern-mining problem was first introduced by Agrwal
and Srikant in [AS94], in which three algorithms were presented. The
algorithm AprioriAll was the only one that found all [-frequent patterns
and was shown to perform better or equal to the other two algorithms.
The same authors in a later work [SA96] presented the GSP algorithm
that outperforms AprioriAll by up to 20 times.
At nearly the same time, Mannila et al. in [MTV95] presented the
problem of finding [-frequent episodes in only one long sequence of
events. An episode is defined as a set of events occurring with a partially
defined order and within a given time bound. In a later work [MT96],
they generalized their work to allow one to express arbitrary unary
conditions on the individual event attributes, or to give binary conditions
on the pairs of event attributes. Their experiments were performed using
a Web server-level logs file. In comparison to our work, we find [-
frequent episodes across many sequences of events. In addition, we are
interested in finding all [-frequent episodes without any imposed
constraints.
Oates and Cohen in [OC96] introduced the problem of detecting
strong dependencies among multiple streams of data. Their measure of
dependency strength is based on the statistical measure of non-
independence. As opposed to our work, the MSDD algorithm proposed in
35
their paper detects unexpectedly frequent or infrequent [-patterns and
also the algorithm generates rules rather than [-frequent sequences.
However, their rule enumeration method is similar in nature to our
frequent pattern growth method in PrefixSpan.
4.3.1 GSP Algorithm
Before we proceed, we need to review the GSP algorithm [SA96] in more
detail as it forms the basis of our comparison algorithm. The algorithm
is based on the a priori heuristic proposed in association mining [AIS93].
The heuristic states the fact that any super-pattern of a non-frequent [-
pattern cannot be [-frequent. The two key features of GSP algorithm is
candidate generation followed by complete pass of the database for
support counting. The GSP algorithm is shown in Figure 4.2.
Input: A sequence database S, and minimum support threshold min_sup Output: The complete set of frequent sequential patterns Method: ^ `sequencesfrequentF � 11
1FF // Set of all frequent sequences.
for (k=2; z� 1kF ø; k++) do
kC = Set of candidate k-sequences
for all sequences s in database S do increment count of all unique sCk %DD |�
^ `supmin_sup.| t� DD kk CF
� kFFF
return F
Figure 4.2: The GSP algorithm
The first step is to compute the support of each item ([-frequent 1-
sequences) in the database. This set is used as an initial seed set for the
second step. From this initial seed set the set of candidate 2-sequences
36
are built. A candidate 2-sequence can consist of a single element with
two [-frequent items, or two elements, each having only one [-frequent
item, such as � �1,1 and � �� �11 . Therefore, if 10 [-frequent 1-sequences
were found in the first pass, GSP generates 20010101010 u�u candidate
2-sequences for Web traversal patterns. Another pass is made over the
database to find the actual support of each candidate 2-sequence and
prune the non-[-frequent 2-sequences. From this point on, any step k
uses the [-frequent (k-1)-sequences as its initial seed set and performs
the following two phases:
x� Candidate generation: for any given pair of [-frequent (k-1)-
sequences � �ss c, , where discarding the first item of s and last item
of sc results in identical sequences, create a new candidate k-
sequence by appending the last element of sc to s . All the k-2
remaining (k-1)-subsequences of the newly created candidate k-
sequence must also be [-frequent. The new item is added as a new
element if it was a separate element in sc , otherwise it is added to
the last element of s . Hence, it uses the a priori heuristic. For
example, given the following three [-frequent 2-sequences � �2,1 ,
� �3,2 , and � �3,1 , the first and the second sequence can be
matched by dropping item 1 and 3 from first and second sequence
respectively. Therefore, we can create the candidate sequence
� �3,2,1 . Note that item 3 is added to the same element because it
was within the same element in the second sequence � �3,2 .
x� Support counting: given the candidate k-sequence set kC , scan
the database once and obtain the actual support of each candidate
k-sequence. By discarding sequences that do not satisfy the
37
minimum required support, [-frequent k-sequence set kL is
formed.
The process continues until no more candidate sequences can be formed.
To access candidate sequences efficiently during support counting phase,
candidate sequences are stored in a hash-tree structure. The algorithm
is optimized during the first and second iterations, by using array and
matrix structures, respectively, to directly access candidate sequences.
There are inherent draw-backs to this approach that are independent of
the implementation techniques used, which are related to minimizing
search space, database scan, and maintaining internal data structures.
Lemma 4.3.1: Let n denote the total number of [-frequent items. Then the
total number of k-sequences is given as: kk nu� 12 .
Proof: First, we prove that a set S of n elements has precisely n2
subsets, where the empty set and S itself are counted as subsets. The
difference between the number of subsets with k elements in set S with
n elements and S c with n+1 elements is:
� �� � � �
� � � �� � � � ¸̧
¹
·¨̈©
§�
��
u
��u����
�
���
� ¸̧
¹
·¨̈©
§�¸̧
¹
·¨̈©
§ �
1!1!
!
!1!
!1!1
!!
!
!1!
!11
k
n
knk
nk
knk
nknn
knk
n
knk
n
k
n
k
n.
Therefore, the difference between the total number of subsets in sets S
and S c is ¦¦¦�
�
��
¸̧¹
·¨̈©
§ ¸̧
¹
·¨̈©
§�
¸̧¹
·¨̈©
§�
�¸̧¹
·¨̈©
§�
� n
i
n
i
n
i i
n
i
n
i
n
n
n
0
1
11 111
1, which is the total number of
subsets in set S . Hence, it follows that the total number of subsets
doubles from the set S with n elements to the set S c with n+1 elements.
Since the total number of subsets for sets with 0 and 1 element are
00
0
210
¸̧¹
·¨̈©
§¦
�i i and 1
1
0
221
¸̧¹
·¨̈©
§¦
�i i, respectively, our formula is true for 0 k
38
and 1 k . Assuming that the formula is true for the integers k,,3,2 � ,
based on the proven doubling effect, the number of subset for the set
with k+1 elements is 1
0
1
0
221 �
�
�
�
¸̧¹
·¨̈©
§u ¸̧
¹
·¨̈©
§ �¦¦ k
k
i
k
i i
k
i
k. Second, we will count the
number of ways in which a k-sequence can be constructed and then
assign items for each arrangement. The number of ways a k-sequence
can be constructed is given by the number of ways a k-sequence can be
partitioned into different number of visits. For a k-sequence, there exits
k-1 partition points, with ¸̧¹
·¨̈©
§�
�¸̧¹
·¨̈©
§ �¸̧¹
·¨̈©
§ �¸̧¹
·¨̈©
§ �
1
1,,
2
1,
1
1,
0
1
k
kkkk� ways to partition
the k-sequence into k,,3,2,1 � number of visits, respectively. For example,
for a 3-sequence we have 2 partition points and four ways to partition
(x,x,x), (x)(x,x), (x,x)(x), and (x)(x)(x). Hence, the total number of ways
that a k-sequence can be partitioned equals to the total number of
subsets of a set with k-1 elements, which is proven to be equal to 12�k .
We now assign events to each position of each k-sequence. Since
repitition is allowed within a visit, we have n choices for each position, or
kn choices for each k-sequences. Hence the total number of k-sequeces
with n number of frquent items is kk nu� 12 . �
First, obviously by applying a priori heuristic, when possible, we
significantly reduce the search space. However, we truly reap its benefits
when k>2. Still, when the number of [-frequent 1-sequences is large, for
example 1000, a priori-based method generates a very large set of
candidate 2-sequences (2,000,000). This requires a significant amount
of resources. Secondly, when the [-frequent sequences form a very
dense lattice, like DNA databases, despite the benefits of the a priori
39
heuristics, the number of candidate k-sequences still grows
exponentially.
Second, in each pass k, all k-subsequences of each database
sequence are searched in the candidate set kC . Therefore, each database
pass becomes significantly costlier than the previous one. This also
becomes more evident with long database sequences.
Third, maintaining an efficient data structure can be beneficial
both in the candidate generation and support counting phases. This is
the approach taken by the PSP algorithm [MCP98].
4.3.2 PSP Algorithm
The algorithm still follows the approach of candidate generation followed
by support counting. However, the authors use prefix-tree instead of
hash-tree as their internal data structure. Using our running example in
Figure 4.1, Figure 4.3 depicts the internal data structures of GSP and
PSP after the candidate 3-sequences have been generated.
Figure 4.3: PSP prefix-tree vs. GSP hash-tree structures
Depth 0
(1,2,6) (4)(1,2) (4)(1,6) (4)(2,6) (4)(2)(1)
Root
1 2 4
2 6 1 2 6
1 2 6 6 1 1 6
PSP Candidate Tree
Intra-item Inter-item
1,2 0
4,6 1
GSP Candidate Hash Tree Hash Function: h(1)=h(2)=0; h(4)=h(6)=1
Leaves
1,2 0
4,6 1
1,2 0
4,6 1 Depth 1
(2,6)(1) (4)(6)(1)
40
During the support counting phase in GSP, for all k-subsequences of
each database sequence, the hash-tree is navigated until reaching a leaf
node storing several candidates. Then each candidate sequence is
examined for a match. However, using the prefix-tree in PSP, the search
for a k-subsequence is terminated when for any j<k, no j-subsequence is
present in the prefix-tree. Once the leaf-node is reached, the only
operation to perform is to increment its support value. Other benefits of
using prefix-tree are that it requires less storage space and improves
efficiency during the candidate generation phase. This approach proves
to be more efficient than GSP. For more detail, please refer to [MCP98].
4.3.3 Memory Management
Note that during the candidate generation phase internal data structures
are traversed in a depth first fashion. But during support counting
phase depth first search may terminate early if the next item does not
exist in the candidate tree. Parthasarathy et. al. [PZL98] have shown
that memory placement policies that take into account the knowledge of
internal data sturcuture and their usage can enhance performance
significantly. Simple Placement Policy (SPP) of allocating memory from
the same region shows an improvement of 35-55% over the algorithm
that uses the system’s malloc function directly. In addition to SPP,
depth-first ordering of the allocated memory adds an additional overhead
of less than 2% of the running time. However, the gain in locality in
small databases with short [-frequent sequences is not sufficient to
overcome this overhead. But, as the length of database sequences and
its [-frequent sequences become longer the payoff quickly adds up in
each database pass. Since our experiements use data sets with long
41
sequences and use support thresholds that also generate long [-frequent
sequences, we allocate the internal data structures in a depth-first
manner. This forms the basis of our comparison algorithm, which is the
improved PSP algorithm, using a depth-first memory placement policy.
We will refer to this algorithm as PSP+.
4.3.4 SPADE Algorithm
In this section we explain the SPADE algorithm. Major emphasis is given
to lattice decomposition and vertical database format. For more detail,
we refer the reader to [Z98].
All the algorithms explained so far work on a horizontal database
format as depicted in Figure 4.1. In horizontal format the database
contains a list of customers (cid), each with its own list of item sets or
elements (tid). On the other hand, the SPADE algorithm works on a
vertical database format, where each k-sequence is associated with an
id-list. Id-list contains the customer id (cid) and element id (tid) pairs of
all sequences that support its associated k-sequence. Figure 4.4 shows
section of the 0.5-frequent sequence lattice of our running example and
its associated id-list. The support of any k-sequence is determined by
intersecting the id-list of two of its generating (k-1)-sequences that
shares the same suffix. The number of unique customer ids in the id-list
determines the support of its associated k-sequence. Note that the
intersection algorithm assumes all items in an element have occurred at
the same time.
A priori-based algorithm traverses the [-frequent sequence lattice
using a breadth-first search, where all [-frequent (k-1)-sequences are
42
generated before [-frequent k-sequences. However, a depth-first search
is also possible.
Figure 4.4: SPADE Id-list intersection
The [-frequent sequence lattice is decomposed into equivalence classes
such that each equivalence class can be processed independently. Let
> @� denote the equivalence class of the [-frequent k-sequences, such that
applying the candidate generation rules described for GSP can generate
all [-frequent sequences with the prefix �. In fact both prefix [Z98] and
suffix [Z00] can be used as an equivalence class relation. Figure 4.4
shows the equivalence classes of [(1)], [(4)], and [(6)] using suffix relation.
Note that all [-frequent 1-sequences belong to the equivalence class of
[ø], which has an empty suffix. Also this process of decomposition can
TID CID
(4)
10 1
25 1
10 4
TID CID
(6)
20 1
25 1
15 2
10 3
20 4
TID CID
(4)(1)
10 1
10 4
TID CID
(1)
15 1
20 1
25 1
15 2
10 3
25 4
TID CID
(4)(6)(1)
10 1
10 4
Intersect (4)(1) and (6)(1)
TID CID
(6)(1)
20 1
20 4
Intersect (6) and (1)
(4)(6)
{ }
(6)
(4)(1) (6)(1)
(4)(6)(1)
(1) (4)
[(1)] [(6)] [(4)]
43
be applied recursively. For additional details on splitting the search
space, we refer the reader to [A01].
Before processing the root equivalence classes from the initial
decomposition however, the id-list for that class must be scanned from
disk into memory. Obviously, the size of these id-lists shrinks as the
sequence length increases. Even if we had enough memory to hold id-list
of [-frequent 1-sequences, it may be costlier to compute [-frequent 2-
sequences using vertical database format. The question is what is the
suitable level for decomposition? The author suggests following GSP
algorithm with horizontal database format until the equivalence class of
a [-frequent k-sequence fits into memory. At the worst case, when
following a depth first approach, we only need to hold 2 id-lists for each
of the two consecutive levels of an equivalence class. Once we have
generated the id-list of the next level we no longer need the previous
level’s id-list. There are some drawbacks to this approach however.
First, since we only use two id-lists to generate the next level’s [-frequent
sequences and do not fully take advantage of the a priori heuristic; we
may unnecessarily examine a larger candidate sequence search space.
Second, due to the limited available memory, this method requires
multiple scans of the database to generate the id-lists for the members of
the root equivalence class. Suitable memory management techniques
may reduce the unnecessary disk I/O.
4.4 FreeSpan: Pattern Growth via frequent item lattice
In our recent study [HPM+00], we developed a projection-based algorithm
called FreeSpan, which uses the frequent item lattice to partition the
database. In this Section 4.4.1, we introduce the idea of pattern growth
44
via frequent item lattice with an example. The algorithm FreeSpan-1 is
then introduced in Section 4.4.2. We then improve its performance by
introducing FreeSpan-2 in Section 4.4.3.
4.4.1 Basic Idea
Definition 4.4.1: Let � �s) be defined as a function that returns the set
of items in sequence s, called transaction pattern of sequence s. Let
� � � �^ �̀ ) �< DD |, ss % be defined as a function that returns the set of
subsequences of s that have transaction pattern �.
Figure 4.5: Transaction database of sequence database
Figure 4.5 shows the transaction pattern database derived from our
running example database. The FreeSpan algorithm is based on the
following relationship between these two databases.
Observation 4.4.1: A sequence cannot be [-frequent if its transaction
pattern is not [-frequent. The reverse, however, may not be true as in
creating a transaction pattern we discard the ordering and repetition of
items in the sequence. Therefore, many sequences may contribute to the
support of the same transaction pattern.
Our aim is to search the frequent item lattice and use it to
partition the database such that the size of the partitions are minimized
Web Access Sequence Database
25 (3,4)(1,2,3)(1,2,6)(1,3,4,6)
2 20 (1,2,6)(5)
3 10 (1,2,6)
4 25 (4,7,8)(2,6)(1,7,8)
1 Userid Time Sequence
{1,2,3,4,6}
{1,2,5,6}
{1,2,6}
{1,2,4,6,7,8}
Transaction pattern
45
and each branch can be processed independently. At each node of the
lattice we only print [-frequent sequential patterns that have such a
transaction pattern. Figure 4.6 shows one possible frequent item lattice
for our running example. The question is which frequent item lattice
results in greater efficiency? We will strive to minimize the size of each
database partition.
Figure 4.6: Frequent item lattice of running example
4.4.2 Level by Level FreeSpan Algorithm
In this section we explain the FreeSpan-1 algorithm that traverses the
frequent item lattice level by level, depth-first.
Definition 4.4.2: Let f-list be defined as the list of [-frequent items ([-
frequent 1-sequences) of database S in the descending support order.
Let � �slistfj ,,,1 ��* be defined as a function that returns the projected
subsequence s%D , such that its transaction pattern � �D) contains items
of j,� and only items in f-list that have support greater than item j. Let
^ `j:� -projected database of S be defined as the set of projected
{1} {2} {4} {6}
{1,4} {2,4} {4,6} {1,6} {2,6}
{1,2,6}
{1,2}
{1,2,4} {1,4,6} {2,4,6}
{1,2,4,6}
{ }
46
sequences of S with respect to f-list of S, using function � �slistfj ,,,1 ��* .
We will refer to ^ `j:� -projected database of S as �|S , where � j� U ,
when the distinction between � and j is not necessary.
Figure 4.7: Section of projection databases in FreeSpan-1
In our example database, f-list equals to (1, 2, 6, 4) and projected
sequence of ^ ` � � � �� �� �� �� �6,4,3,16,2,13,2,14,3,4,6,2,1,2,41* is � �� �� �� �4,12,12,14 .
According to this f-list, the complete set of sequential patterns in the
database can be partitioned into 4 independent projected databases: ^ 1̀: -
projected, ^ `2: -projected, ^ `6: -projected, and ^ `4: -projected database, as
shown in Figure 4.7.
(3,4)(1,2,3)(1,2,6)(1,3,4,6) (1,2,6)(5) (1,2,6) (4,7,8)(2,6)(1,7,8)
(1)(1)(1) (1) (1) (1)
(1,2)(1,2)(1) (1,2) (1,2) (2)(1)
(1,2)(1,2,6)(1,6) (1,2,6) (1,2,6) (2,6)(1)
(4)(1,2)(1,2,6)(1,4,6) (4)(2,6)(1)
{:1}-Proj DB {:2}-Proj DB {:6}-Proj DB {:4}-Proj DB
f-list = {1,2,6,4)
f-list = { } output <(1)>
f-list = {1) output <(2)>
f-list = {1,2) output <(6)>
f-list = {1,2,6) output <(4)>
(1,2)(1,2)(1) (1,2) (1,2) (2)(1)
{2:1}-Proj DB
f-list = { } output <(1,2)>,<(2)(1)>
(1)(1,6)(1,6) (1,6) (1,6) (6)(1)
{6:1}-Proj DB
f-list = { } output <(1,6)>,<(6)(1)>
(1,2)(1,2,6)(1,6) (1,2,6) (1,2,6) (2,6)(1)
{6:2}-Proj DB
f-list = {1} output <(2,6)>
(1,2)(1,2,6)(1,6) (1,2,6) (1,2,6) (2,6)(1)
{2,6:1}-Proj DB
f-list = { } output <(1,2,6)>, <(2,6)(1)>
(4)(1,2)(1,2)(1,4) (4)(2)(1)
{4:2}-Proj DB
f-list = {1} output <(4)(2)>
4)(1,2)(1,2)(1,4) (4)(2)(1)
{2,4:1}-Proj DB
f-list = { } output <(4)(2)(1)>
(4)(1)(1)(1,4) (4)(1)
{4:1}-Proj DB
f-list = { } output <(4)(1)>
47
Even though, in average, sequences in ^ `4: -projected database can
be longer than sequences in its other sibling projection databases (more
items in their transaction pattern), but it has fewer sequences. Whereas,
^ 1̀: -projected database has more sequences but each sequence is
potentially shorter. Also note that infrequent items are not present in
projected databases and the total space requirement of all projected
databases may be larger than the original database. As can be seen in
Figure 4.7, it is possible for a projected database not to shrink. For
example, ^ `6: -projected, ^ `2:6 -projected and ^ 1̀:6,2 -projected databases
have the same size.
The mining process, as shown in Figure 4.8, consists of four main
steps performed on each database �|S recursively. Note, when ^ ` U ,
projection database �|S represents the original database.
x� Scan projection database �|S once and find its [-frequent items
that are not in U . At the same time count the support of all
sequences with transaction pattern U .
x� Print all [-frequent sequences with transaction pattern U .
x� If printed any [-frequent sequence and f-list is not empty, then for
each item j in the f-list create a ^ `j:U -projected database. Scan
projection database �|S a second time and populate each ^ `j:U -
projected database.
x� Recursively mine each newly created ^ `j:U -projected database.
Since the mining task performed for each node in the [-frequent item
lattice is confined to a smaller database that with each recursion
potentially becomes smaller, FreeSpan-1 is more efficient than PSP+.
48
Input: A sequence database S, and minimum support threshold min_sup Output: The complete set of frequent sequential patterns Method: F = {ø} Call FreeSpan-1({ø}, 0, S) return F // U : projection pattern of projection database �|S .
// l: length of projection pattern, U .
// �|S : projection database.
Subroutine FreeSpan-1( U , l, �|S )
for all sequences s in projected database �|S do
increment count of each sequence � �s,UD <� in candidate set �C
increment count of each item � � U�)� jsj | in candidate set 1C
^ `supmin_sup.| t� DD �� CF
� �FFF
^ `supmin_sup.|11 t� DD CF
if 00 1 FOrF� then return
f-list = support descending order of 1F
for all sequences s in projected database �|S do
for each item � � listfjjsj ���)� &| U do
add � �slistfj ,,,1 �* U to ^ `j:U -projected database
for each 1Fi� do
Call FreeSpan-1( � iU , l+1, ^ `i:U -projected database)
return
Figure 4.8: The FreeSpan-1 mining algorithm
The major costs associated with FreeSpan-1 lies within two main
tasks performed in each projection database. First, in function � �s,U< , it
is costly to find all sequences that match a transaction pattern. The
candidate set �C can become very large. That is the reason why the task
of finding [-frequent sequences has been divided and delegated among
child projection databases. Second, in terms of storage and I/O costs, it
is expensive to create each projection database.
49
In each recursion of FreeSpan-1, the length of projection patterns
increase by 1. To delay projection, an alternative-level projection method
is developed, where, in each recursion, projection pattern grows by 2.
4.4.3 Alternative-Level FreeSpan Algorithm
In this section we explain the FreeSpan-2 algorithm that traverses the
frequent item lattice by 2 levels in each recursion, depth-first.
Similar to FreeSpan-1, after the first scan of database, we
construct an f-list. In our example database, f-list equals to (1, 2, 6, 4).
Then, we construct a frequent item matrix F to count the occurrence
frequency of each pair of items in the f-list, as follows. For f-list
� �niii ,,, 21 � , F is a triangular matrix > @kjF , , where nj d�1 and jk �d1 .
> @kjF , contains one counter for the number of occurrences of items ji
and ki together, i.e. the sequence contains any of these subsequences
� � � �kj ii , � � � �jk ii , � �kj ii , , or � �jk ii , .
Figure 4.9: Frequent item matrix of running example after second pass
In our example database, the f-list contains 4 items. We construct
a 44u triangular frequent item matrix, where each counter is initialized
to 0. We scan the database a second time to fill up the matrix, as
follows. For example, Using only the 0.5-frequent items of the first
sequence 6),2,6)(1,4,(4)(1,2)(1 , we increase > @1,4F and > @3,4F counter by 1,
since subsequences (4)(1) and (6)(4) respectively occur here. Figure
4.9 shows the resulting frequent item matrix after the second pass.
4
4
2
4
2 2
2
6
4
1 2 6
50
Figure 4.10: Projection databases in FreeSpan-2
Definition 4.4.3: Let � �slistfkj ,,,,2 ��* be defined as a function, where
)()(& kOrderjOrderkj listflistf ��!z and the function returns the projected
subsequence s%D , such that its transaction pattern � �D) contains items
of kj,,� and only items in f-list that have support greater than item j.
Let ^ `kj,:� -projected database of S be defined as the set of projected
sequences of S with respect to f-list of S, using function
� �slistfkj ,,,,2 ��* . We will refer to ^ `kj,:� -projected database of S as
�|S , where ^ `� kj,� U , when the distinction between � and ^ `kj, is
not necessary.
(3,4)(1,2,3)(1,2,6)(1,3,4,6) (1,2,6)(5) (1,2,6) (4,7,8)(2,6)(1,7,8)
f-list = {1,2,6,4) output <(1)>, <(2)>, <(6)>, <(4)>
(1,2)(1,2)(1) (1,2) (1,2) (2)(1)
{:1,2}-Proj DB
f-list = { } output
(1)(1,6)(1,6) (1,6) (1,6) (6)(1)
{:1,6}-Proj DB
f-list = { } output <(1,6)>,<(6)(1)>
(1,2)(1,2,6)(1,6) (1,2,6) (1,2,6) (2,6)(1)
{:2,6}-Proj DB
f-list = {1} output <(2,6)>, <(1,2,6)>, <(2,6)(1)>
{:2,4}-Proj DB
f-list = {1} output <(4)(2)>, <(4)(2)(1)>
(4)(1)(1)(1,4) (4)(1)
{:1,4}-Proj DB
f-list = { } output <(4)(1)>
(4)(1,2)(1,2)(1,4) (4)(2)(1)
{:6,4}-Proj DB
f-list = {1, 2} output <(4)(6)>, <(4)(2,6)>, <(4)(6)(1)>
(4)(1,2)(1,2,6)(1,4,6) (4)(2,6)(1)
{6,4:1,2}-Proj DB
f-list = { } output <(4)(2,6)(1)>
(4)(1,2)(1,2,6)(1,4,6) (4)(2,6)(1)
51
Input: A sequence database S, and minimum support threshold min_sup Output: The complete set of frequent sequential patterns Method: F = {ø} Call FreeSpan-2({ø}, 0, S) return F // U : projection pattern of projection database �|S .
// l: length of projection pattern, U .
// �|S : projection database.
Subroutine FreeSpan-2( U , l, �|S )
for all sequences s in projected database �|S do
increment count of each sequence � �s,UD <� in candidate set �C
increment count of each item � � U�)� jsj | in candidate set 1C
^ `supmin_sup.| t� DD �� CF
� �FFF
^ `supmin_sup.|11 t� DD CF
if 00 1 FOrF� then return
f-list = support descending order of 1F
for all sequences s in projected database �|S do
fill up frequent item matrix for each item � � listfjjsj ���)� &| U do
increment count of each seq. � �sj,�UD <� in candidate set 1��C
^ `supmin_sup.|11 t� �� DD CF
if 01 �F then return
� 1� FFF
for all sequences s in projected database �|S do
for each item � � listfkjkjskj ���)� ,&,|, U do
add � �slistfkj ,,,,2 �* U to ^ `kj,:U -projected database
for each frequent pair ^ `�kj, frequent item matrix do
Call FreeSpan-2( ^ `� kj,U , l+2, ^ `kj,:U -projected database)
return
Figure 4.11: The alternative-level FreeSpan mining algorithm
52
The frequent item matrix is used to generated a set of projected
databases. Using the frequent item matrix in Figure 4.9, the database
can be partitioned into 6 independent projected databases: ^ `2,1: -
projected, ^ `6,1: -projected, ^ `6,2: -projected, ^ `4,1: -projected, ^ `4,2: -
projected, and ^ `4,6: -projected database, as shown in Figure 4.10.
The FreeSpan-2 algorithm is presented in Figure 4.11. Each
recursion costs 3 database scans. The main steps performed on each
database �|S is outlined bellow. Note, when ^ ` U , projection database
�|S represents the original database.
x� Scan projection database �|S once and find its [-frequent items
that are not in U . At the same time count the support of all
sequences with transaction pattern U .
x� Print all [-frequent sequences with transaction pattern U .
x� If printed any [-frequent sequence and f-list is not empty, then
scan database a second time and fill up frequent item matrix. At
the same time count the support of all sequence with transaction
pattern � ��ilistfi U��� .
x� Print all [-frequent sequences with transaction pattern
� ��ilistfi U��� .
x� If printed any [-frequent sequence and frequent item matrix
contains [-frequent item sets of length 2, then for each [-frequent
pair � � � �kOrderjOrderkj listflistf ��!|, in the matrix create a ^ `kj,:U -
projected database. Scan projection database �|S a third time and
populate each ^ `kj,:U -projected database.
x� Recursively mine each newly created ^ `kj,:U -projected database.
The degree to which a projected database shrinks with respect to its
immediate parent is dependent on the selectivity of projection pattern
53
and number of infrequent patterns present. Unfortunately, this
selectivity factor is database dependent. In the next section we present
an algorithm that displays a greater degree of selectivity per recursion.
4.5 PrefixSpan: Pattern Growth via frequent sequence lattice
In this section, we introduce a new projection-based method for mining
sequential patterns, called PrefixSpan [HPM+01], which uses the frequent
sequence lattice to partition the database. Section 4.5.1 introduces the
theory behind the algorithm. Then level-by-level PrefixSpan algorithm is
presented in Section 4.5.2. To improve its efficiency, pseudo projection
technique is proposed in section 4.5.3. Finally, section 4.5.4 addresses
issues concerning its scalability.
4.5.1 Basic Idea
Definition 4.5.1: Following Definition 4.2.4, let access sequence
mvvvS ccc c ,,, 21 � be a prefix of access sequence nvvvS ,,, 21 � � �nm d ,
such that for ii vvmi c�d ,1 and for ,,,, 21 jm eeev ccc c � km eeev ,,, 21 �
� �kj d , jl dd1 , ll ee c . We call access sequence nmmm vvvvS ,,,, 21 ���cc cc a
suffix of S with respect to Sc , where kjjm eeev ,,, 21 ��� cc . Let this relation
be denoted as SSS c cc / or SSS cc�c . If element mv cc is empty, we will call
the suffix sequence an inter-suffix sequence, otherwise intra-suffix
sequence and denote it as � ccS . Items within mv cc are called the intra-
extension items of prefix Sc , while items within � �nimvi dd�1 are called
inter-extension items of prefix Sc . To distinguish the element mv cc from
other elements, we will underline its first intra-item as � �kjj eee ,,, 21 ��� .
54
For example, given the sequence � �� �� �� �543,21 S , sequences � �� �� �543,2 ,
� �� �� �543 , and � �� �54 are called suffix sequences of S with respect to
prefix sequences � �1 , � �� �21 , and � �� �3,21 respectively. Note sequence
� �1 cannot be extended with any intra-item, while sequence � �� �21 can
only be extended with the intra-item of 3.
Figure 4.12: Prefix-based traversal of frequent sequence lattice
Lemma 4.5.1: All prefix sequences of a [-frequent sequential pattern are
[-frequent.
This is obvious based on the a priori heuristics. Let us turn our
attention to the running example in Figure 4.1. Based on the above
lemma, all [-frequent sequences can be enumerated by traversing the
frequent sequence lattice and, for each [-frequent (k)-sequence � �1!k ,
only extending its prefix (k-1)-sequence. As mentioned in section 4.3.4, a
{ }
(1)
(1,6)
(2) (4) (6)
(2,6) (2)(1) (4)(1) (4)(2) (4)(6) (6)(1) (1,2)
(4)(6)(1) (1,2,6) (2,6)(1) (4)(2)(1) (4)(2,6)
(4)(2,6)(1)
55
suffix-based traversal of frequent sequence lattice is also possible.
Figure 4.12 shows prefix-based traversal of our running example’s
frequent sequence lattice.
Now what remains is to partition the database such that at each
node of Figure 4.12 only the relevant data is processed to find out if a [-
frequent suffix sequential pattern exits.
Definition 4.5.2: Let access sequence mvvvS ccc c ,,, 21 � be a prefix-match
of access sequence nvvvS ,,, 21 � � �nm d , such that for
niii m �dddd � 1211 � , 121 121 ,,,
��cccmimii vvvvvv %�%% and for ,,,, 21 jm eeev ccc c �
km eeev ,,, 21 � � �kj d , klll j ddddd �211 , jljll eeeeee c c c ,,,
21 21 � . We
call access sequence nmmm vvvvS ,,,, 21 ���cc cc a suffix-match of S with
respect to Sc , where kjjm eeev ,,, 21 ��� cc . Let this relation be denoted as
SSS ccxc . If element mv cc is not empty the suffix-match sequence is
denoted as � ccS . Let the prefix-projection of sequence s be defined by
function � � � �� �^ ` ^ `� %%% �� x :���x : EVEOEVOOEVEV ssssss |&,&|,
as the set of suffix-match sequences of s with respect to prefix V .
Note that a prefix-projection set contains all intra-suffix-match
sequences and only those inter-suffix-match sequences that are not
subsequence of the rest. For example, given the sequence
1,3,4,6)3)(1,2,6)((3,4)(1,2, S , its prefix-projection sequences with respect
to prefix sequences � �3 , � �� �13 , and � �� �� �113 are the sequence sets
^ `)6,4(,)6,4,3,1)(6,2,1)(3,2,1)(4( , ^ `)6,4,3(,)6,4,3,1)(6,2(,)6,4,3,1)(6,2,1)(3,2( , and
^ `)6,4,3(,)6,4,3,1)(6,2( respectively. For the sake of abbreviation from here
on, we will overlap all suffix-match sequences in a prefix-projection set,
as follows. The starting elements of all intra-suffix-matches that are
subsequence of another are indicated by an underline in the maximal
56
super-sequence of the set. Therefore, the previous example’s prefix-
projections are: )6,4,3,1)(6,2,1)(3,2,1)(4( , )6,4,3,1)(6,2,1)(3,2( , and
)6,4,3,1)(6,2( .
Definition 4.5.3: Let V -projected database, denoted as �|S , be the
collection of prefix-projection sequences returned by function � �V,s: .
For example, Figure 4.13 shows the projected databases of each 0.5-
frequent 1-sequence of the running example, where all infrequent items
have also been filtered out.
Figure 4.13: Projected databases of 0.5-frequent 1-sequences
Definition 4.5.4: Let V be a [-frequent sequential pattern in database
S. The support of the sequential pattern OVE � (or �� OVE ) in �|S is
^ `sSs %O� ||� (or ^ `sSs %�� O� || ).
For example, let’s consider the process of counting 0.5-frequent 1-
sequences in <(1)>-projected database of Figure 4.13. Sequence
<(2)(1,2,6)(1,4,6)> can extend prefix <(1)> with inter-extension items
{1,2,4,6} and intra-extension items {2,4,6}. Therefore, we increment the
inter- and intra-counters of each item accordingly.
<(2)(1,2,6)(1,4,6)>, <(2,6)>, <(2,6)>
<(2)> <(1,2,6)(1,4,6)>, <(6)>, <(6)>, <(6)(1)>
<(4)> <(6)>, <(2,6)(1)>
<(6)> <(1,4,6)>, <(1)>
<(1)> Prefix Projected Database
57
Lemma 4.5.2: Let V be a [-frequent sequential pattern and OVE � (or
�� OVE ) a sequential pattern in sequence database S, then
� � � �OE �|supsup SS (or � � � �� OE �|supsup SS ).
Proof: From Definition 4.5.3, we know each V -projected database only
contains sequences that can extend prefix V , which are exactly the
sequences contributing to � �ESsup . Furthermore, the support counting
procedure of Definition 4.5.4 ensures that we only increment the
counters that contribute to support of OVE � (or �� OVE ). �
4.5.2 Level-by-Level PrefixSpan Algorithm
In this section we explain the PrefixSpan-1 algorithm that traverses the
frequent item lattice level-by-level, depth-first.
In each projection, this algorithm attempts to increases the length
of [-frequent sequential pattern (projection prefix) by one. The mining
process, as shown in Figure 4.14, consists of three main steps performed
on each database �|S recursively. Note, when V , projection
database �|S represents the original database.
x� Scan projection database �|S once and find its [-frequent inter-
and intra- 1-sequences.
x� For each [-frequent 1-sequence, extend prefix V accordingly and
print the resulting sequential pattern.
x� If printed any [-frequent sequential pattern, then scan projection
database �|S a second time and for each unique inter- 1F�D (or
intra- �
� 1FD ) [-frequent 1-sequence found in each sequence
�|Ss� , add its prefix-project � �D,s: (filtering out infrequent items)
to projection database ��|S (or � �|S ).
58
x� Recursively mine each newly created projected database if their
number of sequences is greater than minimum support count.
Input: A sequence database S, and minimum support threshold min_sup Output: The complete set of frequent sequential patterns Method: F = {ø} Call PrefixSpan-1({ø}, 0, S) return F // V : projection pattern of projection database �|S .
// l: length of projection pattern, V .
// �|S : projection database.
Subroutine PrefixSpan-1( U , l, �|S )
for all sequences s in projected database �|S do
increment count of each intra 1-sequence of s in candidate set �
1C
increment count of each inter 1-sequence of s in candidate set �
1C
^ `supmin_sup.|11 t� ��
DD CF
^ `supmin_sup.|11 t� ��
DD CF
����
11 FFFF
if 011 ��
�FF then return
for all sequences s in projected database |S do
for each unique intra 1-sequence
� 1FD of s add � �D,s: to �� .|S
for each unique inter 1-sequence �
� 1FD of s add � �D,s: to �� .|S
for each �
� 1FD do
if supmin_| . t���S then Call PrefixSpan-1( �DV . , l+1, ��� .|S )
for each
�� 1FD do
if supmin_|.
t��S then Call PrefixSpan-1( DV . , l+1, �� .|S )
return
Figure 4.14: The PrefixSpan-1 mining algorithm
Figure 4.15 shows all the projection databases generated and 0.5-
frequent sequences printed at each projection by PrefixSpan-1 in our
59
running example. In comparison to FreeSpan there is significantly less
work to be done per projection. On the other hand, there are more
projections to process. The number of counters required to update for 1-
sequences in PrefixSpan is twice as many as FreeSpan. But this
amounts to a very small memory increase.
Prefix Projected Database Freq.
Inter-items
Freq. Intra-items
Frequent [-Patterns
<(3,4)(1,2,3)(1,2,6)(1,3,4,6)>, <(1,2,6)(5)>, <(1,2,6)>, <(4,7,8)(2,6)(1,7,8)>
1,2,4,6
<(1)>, <(2)>, <(4)>, <(6)>
<(1)> <(2)(1,2,6)(1,4,6)>, <(2,6)>, <(2,6)>
2,6 <(1,2)>,<(1,6)>
<(1,2)> <(2,6)(6)>, <(6)>, <(6)> 6 <(1,2,6)> <(1,2,6)> <(6)> <(1,6)> <(6)> <(2)> <(1,2,6)(1,4,6)>, <(6)>,
<(6)>, <(6)(1)> 1 6 <(2)(1)>,<(2,6)>
<(2)(1)> <(6)(1,6)> <(2,6)> <(1,6)>, <(1)> 1 <(2,6)(1)> <(2,6)(1)> <(4)> <(1,2)(1,2,6)(1,4,6)>,
<(2,6)(1)> 1,2,6 <(4)(1)>,<(4)(2)>,<(4)(6)>
<(4)(1)> <(2)(1,2,6)(1,6)> <(4)(2)> <(1,2,6)(1,6)>, <(6)(1)> 1 6 <(4)(2)(1)>,
<(4)(2,6)> <(4)(2)(1)> <(6)(1,6)> <(4)(2,6)> <(1,6)>, <(1)> 1 <(4)(2,6)(1)> <(4)(2,6)(1)> <(4)(6)> <(1,6)>, <(1)> 1 <(4)(6)(1)> <(4)(6)(1)> <(6)> <(1,4,6)>, <(1)> 1 <(6)(1)> <(6)(1)>
Figure 4.15: Projection databases in PrefixSpan-1
In FreeSpan we could not guarantee that a projected database
would shrink in size, whereas, PrefixSpan demonstrates a greater
selectivity. In practice, only a small set of sequential patterns grow very
long. Hence, the size of projected database shrinks rapidly as the
60
projection prefix grows. Also, since PrefixSpan projects suffix sequences,
a sequence must shrink in length by at least one. Note the number of
projections in PrefixSpan-1 equals to the number of [-frequent
sequences. This can be greater than FreeSpan. Having to process more
projections, even though smaller in size, could still add up to a non-
trivial cost. In the next section we develop a strategy to reduce the cost
of creating a projection.
4.5.3 Pseudo Projection
The main idea behind pseudo projection is to reduce the required storage
to represent a projection database by enabling different projection
databases to share the same physical database while accessing the parts
that are relevant to it. This also reduces the cost of creating a projection
database.
Figure 4.16: Pseudo Projection with virtual-memory window
To this end we use pointers to share the same physical database.
Projecting a sequence shrinks it in two ways: by removing infrequent
items and a section of its prefix (refer to Figure 4.15). By sharing the
{p1, p2, p3} {p1} {p1}
<(1)>-projected database
<(3,4)(1,2,3)(1,2,6)(1,3,4,6)> <(1,2,6)(5)> <(1,2,6)> <(4,7,8)(2,6)(1,7,8)>
{p1, p2 } {p1} {p1} {p1}
<(2)>-projected database
Inter-process and machine sharable virtual memory
Virtual-memory window slides and page-fault is issued to load that section of the database
61
same physical database, we prohibit ourselves from apply any
modifications to the database, such as removing infrequent items.
Therefore, in pseudo projection databases, we only allow a sequence to
shrink by ignoring section of its prefix. Let us look at the first sequence
in the original database in Figure 4.15. It is projected to � �1 -projected
database at three points, all of which are intra-suffix-matches. Hence,
we represent the first sequence in � �1 -projected database as three
pointers to different items of the first sequence in the original database,
while the second and third sequences are represented with a single
pointer each. In this fashion we reduce the memory requirement for
creating a projection database at the cost of recounting parent
projection’s infrequent items. This is a small cost to pay compared to the
cost of creating physical projections at each recursion, not to mention
their storage requirement.
In practice databases are larger than the available memory and yet
a small portion of that available memory, called the system I/O buffer, is
used to load the database into memory. Repeated use of pseudo
projection can create large regions between pointers that are not
relevant. Different projections can also randomly access different
sections of the database. As pseudo projected databases become
randomly scattered chunks of sequences, we incur significant
unnecessary disk I/O as we load and reload database pages and only use
small sections of it.
First, to manage loading of database pages more efficiently, we
leverage on the available memory management technology and create a
virtual-memory window over the database. Thereby, exercising more
control over the number of database pages loaded and the page fault
62
strategy used by the system. Popular operating systems, such as Unix,
Linux, and NT, provide high-priority kernel-level primitives to implement
these objects. In NT these kernel-level objects, called Section Objects
(SO), are also sharable across process and machine. As illustrated in
Figure 4.16 each pointer now is valid within a specific virtual-memory
window. Our virtual-memory window manager is responsible for loading
the correct database pages before we de-reference the first item in each
sequence.
Second inherent weakness of this approach is due to growing
chunks of irrelevant data between pointer. To overcome this, first, we
measure the density of a pseudo-projected database, expressed as the
ratio of the number of sequences over the number of system pages (its’
size is system dependent, usually 4 MB) it spans over. Second, we
measure the selectivity of a pseudo-projected database, expressed as the
density ratio of pseudo-projected database over its physical database.
The physical projection operation is performed intermittently when the
selectivity drops bellow a certain threshold. In practice, when database
size falls bellow the size of a few system pages (usually 12 MB, but
depends on the available memory) it does not get swapped out by the
system and remains in the memory. At this point, performing further
physical projections becomes unnecessary. Therefore, the virtual
memory window size is set to this proj_min_size threshold.
However, there is a tradeoff on the value selected for selectivity. It
is impractical to perform too many physical projection as disk write
operation is costly. As the selectivity threshold increases, the cost of
projection rises, since more nodes at the lower levels of the frequent
sequence lattice will also need to be physically projected until it reaches
63
proj_min_size. On the other hand, the density of the projected database
is guaranteed to be higher and their size smaller and thus the cost of
subsequent projection and counting in that sub-tree is reduced. In
practice, we found that there is only a small difference in choosing a
value for selectivity threshold between 0.35 and 1. Bellow 0.35, the cost
of unnecessary disk I/O is high as physical projections are not performed
often enough, but between 0.35 and 1, the cost of physical projection is
balanced by the reduced unnecessary disk I/O costs. The reason for this
result is that the selectivity value depends only on the support of the [-
frequent 1-sequences. In the real databases, many 1-sequences have
low support and low correlation with their prefix, and thus, any
selectivity threshold above the supports will result in creating small
physical projection. After a few recursions, this quickly reduces the size
of the physical projections to bellow the threshold of proj_min_size and
thus all the subsequent operations in that sub-tree is done from the
memory.
4.5.4 Scaling up
Another major cost of projection based algorithms is their storage
requirement. With the size of Web logs gathered daily reaching gigabytes
and growing, and despite the availability of inexpensive and large storage
devices and growing size of current main memory, memory and disk
limitations still impose great restrictions on projection based algorithms.
As analyzed in Section 4.3.3, SPADE algorithm can continue with only
two projections present per level at the cost of processing a greater
search space. In PrefixSpan with pseudo projection all the information
necessary for the mining process is contained within a projection, and
64
thus the algorithm only requires one projection per level, without
suffering any increase in its search space. When there is a disk or
memory limitation present, similar to SPADE algorithm, we pay the cost
of performing multiple database scans to create the child projections in
stages.
4.6 Summary
In this chapter, we described in detail PrefixSpan with pseudo projection,
a new algorithm for fast mining of Web traversal patterns in Web log
files. In this approach we progressively partition the database into
smaller sub-databases using the frequent sequence lattice. Depending
on the selectivity of the prefix sequence, it is very likely that its projected
database can be processed in main-memory. In the next chapter we will
present our experimental results.
65
Chapter 5
Performance Analysis and Discussions
In this chapter we compare the performance of PrefixSpan with pseudo
projection with the PSP+ and FreeSpan algorithms. The PSP+ algorithm
is implemented as described in Section 4.3.3. Experiments were
performed on a 647 MHz Pentium III PC, with 192 MB of main memory,
576 MB of virtual memory, 10 GB of local disk space (Fujitsu
MHM2100AT), and running Microsoft Windows 2000 Server. We
implemented all the algorithms using Microsoft Visual C++ 6.0.
Short form Description Parameter
D Number of customers -ncust C Avg. number of transactions per
customer -slen
T Avg. number of items per transaction -tlen S Avg. number of items in maximal
potential frequent sequences -seq.patlen
I Avg. number of items in maximal potential frequent item sets
-lit.patlen
NS Number of maximal potential frequent sequences
-seq.npats
NI Number of maximal potential frequent item sets
-lit.npats
N Number of items -nitems
Table 5.1: Synthetic data generation program’s parameters
5.1 Synthetic Datasets
Our synthetic datasets are generated using the publicly available
synthetic data generation program of the IBM Quest data mining project
[IQ01], which has been used in most sequential pattern mining studies
[SA96, HPM+01, MCP98, Z00]. The datasets consist of sequences of item
sets, where each item set represents a market basket transaction. For
66
the purpose of Web log mining we do respect the order of items within
each transaction. Table 5.1 shows the parameters we set for the
synthetic data generation program. The generator program works as
follows. First N items, from 0 to N-1, are used to create NI maximal item
sets of average length I. Then, NS maximal sequences of average length S
are created from the members of NI maximal item sets. Next, an average
of C members of NS maximal sequences are added to a customer
sequence as transactions, ensuring average transaction length of T. The
process continues until D customer sequences are created. Table 5.2
shows our test data sets and their parameter settings. For additional
detail on the data generation program, the interested reader can refer to
[AS95]. Similar to [SA96, MCP98], we set NS = 5,000 and NI = 25,000.
As reading a binary file is more efficient than parsing a text file, all
datasets are generated in the binary format.
Data set N C T S I D Size
(MB) C10-T2.5-S4-I1.25-D100K 1K 10 2.5 4 1.25 100K 11.44 C10-T5-S4-I1.25-D100K 1K 10 5 4 1.25 100K 20.06 C10-T5-S4-I2.5-D100K 1K 10 5 4 2.5 100K 19.41 C20-T2.5-S4-I1.25-D100K 1K 20 2.5 4 1.25 100K 24.36 C20-T2.5-S4-I2.5-D100K 1K 20 2.5 4 2.5 100K 22.15 C20-T2.5-S8-I1.25-D100K 1K 20 2.5 8 1.25 100K 23.88 C10-T8-S7-I3-D100k 1K 10 8 7 3 100K 20.07 C10-T5-S5-I1.5-D100k 1K 10 5 5 1.5 100K 19.21
Table 5.2: Synthetic data sets
5.2 Berkeley Web Log Dataset
The Web log file was collected from the University of California Berkeley’s
website. The site hosts a variety of information, ranging from university,
67
course, and department information to individual student’s websites.
The Web log covers the year of 1999 and has 103,680 distinct URLs,
1,079,621 distinct hosts, and a total of 1,079,771 sequences. Its size is
45.79 MB and has an average sequence length of 7. Figure 5.1 shows
the distribution of its sequence length.
Figure 5.1: Distribution of Web log sequence length
5.3 Comparison of PrefixSpan with PSP+ and FreeSpan
The scalability of PrefixSpan, PSP+ and FreeSpan as the support
threshold decreases from 1% to 0.25% is shown in Figures 5.2 and 5.3.
It is easy to see that PrefixSpan scales much better than PSP+ and
FreeSpan. As shown in the figures, as the support threshold goes down,
the number and the length of [-frequent sequences increases. The
change in the density of the database’s frequent sequence lattice from
one support threshold to the other determines the degree of this
increase, which is completely database dependent. The density of
frequent sequence lattice directly affects the number of the candidates in
PSP+.
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 6 9 17 485876Sequence Length
Per
cen
tile
68
Figure 5.2: Performance Comparison: Synthetic Dataset
N1K-C10-T2.5-S4-I1.25-D100K (11.44 MB)
0
20
40
60
80
1% 0.75% 0.50% 0.25%Minimum support
Tim
e (s
eco
nd
s)
PrefixSpan-1
PSP+
FreeSpan-1
FreeSpan-2
N1K-C10-T2.5-S4-I1.25-D100K (11.44 MB)
02468
10121416
1 2 3 4 5 6 7
Co
un
t (t
ho
usa
nd
s)
Frequent k-sequence
1%
0.75%
0.50%
0.25%
N1K-C10-T5-S4-I1.25-D100k (20.06 MB)
0200400600800
1000120014001600
1% 0.75% 0.50% 0.25%Minimum support
Tim
e (s
eco
nd
s)
PrefixSpan-1
PSP+
FreeSpan-1
FreeSpan-2
N1K-C10-T5-S4-I1.25-D100k (20.06 MB)
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11C
ou
nt
(th
ou
san
ds)
Frequent k-sequence
1%
0.75%
0.50%
0.25%
N1K-C10-T5-S4-I2.5-D100k (19.41 MB)
0
200
400
600
800
1000
1200
1400
1600
1% 0.75% 0.50% 0.25%Minimum support
Tim
e (s
eco
nd
s)
PrefixSpan-1
PSP+
FreeSpan-1
FreeSpan-2
N1K-C10-T5-S4-I2.5-D100k (19.41 MB)
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Co
un
t (t
ho
usa
nd
s)
Frequent k-sequence
1%
0.75%
0.50%
0.25%
N1K-C20-T2.5-S4-I1.25-D100k (24.36 MB)
0200400600800
1000120014001600
1% 0.75% 0.50% 0.25%Minimum support
Tim
e (s
eco
nd
s)
PrefixSpan-1PSP+FreeSpan-1FreeSpan-2
N1K-C20-T2.5-S4-I1.25-D100k (24.36 MB)
020406080
100120140160
1 2 3 4 5 6 7 8 9 10 11
Co
un
t (t
ho
usa
nd
s)
Frequent k-sequence
1%
0.75%
0.50%
0.25%
69
Figure 5.3: Performance Comparison: Synthetic Datasets
The longer is the candidates and the higher is the number of candidates,
the greater is the size of the candidate tree and the longer the candiates
become, the more expensive it is for PSP+ to do pattern matching for
each database sequence, particularly for long sequences.
For each sequence of a projected-database, FreeSpan needs to
count the frequency of all subsequences with the projected pattern of its
projected database. This is an extra cost per projected database over
PrefixSpan, that increases exponentially with the length of sequence.
This extra cost is conpensated for only if the number of projections in
FreeSpan is significantly less than PrefixSpan. Obviously, this depends
on the repetition of items in [-frequent sequences. In the worst case
where all [-frequent sequence have a unique projected pattern, the
number of projected databases could be greater than the number of [-
frequent sequences, since every projected database must be processed
N1K-C20-T2.5-S4-I2.5-D100k (22.15 MB)
0200400600800
1000120014001600
1% 0.75% 0.50% 0.25%Minimum support
Tim
e (s
eco
nd
s)
PrefixSpan-1
PSP+
FreeSpan-1
FreeSpan-2
N1K-C20-T2.5-S4-I2.5-D100k (22.15 MB)
020406080
100120140160
1 2 3 4 5 6 7 8 9 10 11
Co
un
t (t
ho
usa
nd
s)
Frequent k-sequence
1%
0.75%
0.50%
0.25%
N1K-C20-T2.5-S8-I1.25-D100k (23.88 MB)
0200400600800
1000120014001600
1% 0.75% 0.50% 0.25%Minimum support
Tim
e (s
eco
nd
s)
PrefixSpan-1
PSP+
FreeSpan-1
FreeSpan-2
N1K-C20-T2.5-S8-I1.25-D100k (23.88 MB)
020406080
100120140160
1 2 3 4 5 6 7 8 9 10 11 12C
ou
nt
(th
ou
san
ds)
Frequent k-sequence
1%
0.75%
0.50%
0.25%
70
and some projected databases (that is, some leaf nodes in projection
database tree) may not generate any [-frequent sequence. In most of our
experiements for FreeSpan, the number of projected databases was
greater than or close to the number of [-frequent sequences. However, in
PrefixSpan we can guarantee that the number of projection databases is
equal to or less than the number of [-frequent sequences. Even though
FreeSpan-2 reduces the number of projected databases, most of the time
it hold more projection databases in memory at a given time. As shown
in Figures 5.2 and 5.3 the number of [-frequent 2-sequences are
extremely large for low support thresholds. This results in memory
swapping, a significant overhead for FreeSpan-2. We have addressed
this drawback in PrefixSpan with pseudo projection in Section 4.5.4.
Figure 5.4: Performance Study: Long Synthetic Datasets
PrefixSpan-1 Performance
0
200
400
600
800
1% 0.75% 0.50% 0.25%Minimum Support
Tim
e (s
eco
nd
s)
N1K-C10-T8-S7-I3-D100k
N1K-C10-T5-S5-I1.5-D100k
PrefixSpan-1 Performance
0
200
400
600
800
1000
1% 0.75% 0.50% 0.25%Minimum Support
Pro
j / s
eco
nd
N1K-C10-T8-S7-I3-D100k
N1K-C10-T5-S5-I1.5-D100k
N1K-C10-T8-S7-I3-D100k (20.07 MB)
0
20
40
60
80
100
120
140
160
1 2 3 4 5 6 7 8
Co
un
t (t
ho
usa
nd
s)
Frequent k-sequence
1%
0.75%
0.50%
N1K-C10-T5-S5-I1.5-D100k (19.21 MB)
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Co
un
t (t
ho
usa
nd
s)
Frequent k-sequence
1%
0.75%
0.50%
0.25%
71
Figure 5.4 shows the scalability of PrefixSpan-1 in databases with long
sequences and denser frequent sequence lattice. The performance of the
other algorithms were not included as they were by far worse. The figure
also shows the number of projected databases processed per second in
PrefixSpan. As the support threshold is lowered, this number goes up
dramatically, especially for data set C10-T5-S5-I1.5-D100K. This fact
explains the scalability of PrefixSpan and can be attributed to the
following. As the sequential patterns become longer, the projected
databases become smaller, much smaller than FreeSpan. Therefore, the
corresponding processing time also reduces. There are serveral other
reasons why PrefixSpan outperforms FreeSpan and PSP+.
1. PrefixSpan performs two simple scans of the projected database to
find inter- and intra-[-frequent items.
2. There are no complicated data structures used and there are no
overhead of searching for subsequences.
3. Pseudo projection ensures a low memory usage and reduces the
cost of creating a projection, controlled physical projections
ensures dense datasets, virtual memory ensures low disk I/O and
high data locality, and partial materlialization of pseudo
projections (section 4.5.4) ensures no memory swapping.
Having addressed the strengths of PrefixSpan, let us look at the database
characteristic that significantly affects its performance. Even though
PrefixSpan guarantees that, as the sequential patterns become longer,
the projected databases become smaller, this selectivity factor is
database dependent. As shown in Figure 5.4, data set C10-T5-S5-I1.5-
D100K shows a much lower cost per projection, which can be explained
by the selectivity factor of its [-frequent sequences.
72
We also compared the performance of the algorithms on Berkeley’s
Web log database. The results are shown in Figure 5.5. At the highest
support threshold of 2% the other algorithms would take over 2000
seconds. This is explained by the excessively long sequences of the Web
log. Therefore, we did not include them in the graph.
Figure 5.5: Performance Study: Web Log Dataset
5.4 Scale up
(a) (b)
Figure 5.6: Scale up: Number of Customers
Figure 5.6(a) shows the scalability of PrefixSpan, FreeSpan and PSP+ as
the number of data sequences is increased. All the algorithms were run
on the N1K-C10-T2.5-S4-I1.25 data set with different minimum support
UCBerkeley 1999 (45.79 MB)
0
5
10
15
20
25
30
1.25% 1.00% 0.75% 0.50%Minimum support
Tim
e (s
eco
nd
s)
PrefixSpan-1
UCBerkeley 1999 (45.79 MB)
0
20
40
60
80
100
120
1 2 3 4 5 6Frequent k-sequence
Co
un
t
1.25%
1.00%
0.75%
0.50%
N1K-C10-T2.5-S4-I1.25 (11.44-57.2 MB)
1
2
3
4
5
6
7
100K 200K 300K 400K 500KNumber of Customers
Rel
ativ
e T
ime
Psp+ (1%)
Psp+ (0.5%)PrefixSpan-1 (1%)
PrefixSpan-1 (0.5%)FreeSpan-1 (1%)
FreeSpan-1 (0.5%)FreeSpan-2 (1%)
FreeSpan-2 (0.5%)
N1K-C10-T5-S5-I3 (11.47-92.23 MB)
1.00
2.00
3.00
4.00
5.00
6.00
100K 200K 300K 400K 500KNumber of Customers
Rel
ativ
e T
ime
PrefixSpan-1 (0.75%)
PrefixSpan-1 (0.5%)
PrefixSpan-1 (0.25%)
73
thresholds ranging from 2% to 0.5% as the number of customers was
increased ten times from 100,000 to 1 million. Figure 5.6(b) shows
scalability of PrefixSpan on the N1K-C10-T5-S5-I3 data set that has
somewhat longer sequences. It can be seen that all the algorithms are
linearly scalable within the range shown in the figure.
74
Chapter 6
Conclusions and Future Works
Information content of WWW is growing at an exponential rate and it is
not surprising to find users having difficulty navigating it and finding
relevant information, or e-commerce sites having difficulty to observe
potential customers. In this study we have used the Web Access log files
of a specific Web site and data mining techniques to do such analysis. In
the following sections, our work is summarized and research directions
are discussed.
6.1 Conclusions
In this thesis we have studied two problems associated with scalable
Web Usage Mining, namely multidimensional analysis and sequential
pattern mining of Web log files. First we introduced a scalable system
architecture that uses static multidimensional aggregations of OLAP to
do simple frequency analysis. Next we developed a novel, scalable and
efficient sequential pattern discovery algorithm to analyze [-frequent
sequential patterns. We have experimentally evaluated our approach in
both cases. The main contributions of this thesis are:
1. Implemented a system that incorporates the multidimensional data
cubes and OLAP technology to interactively extract implicit
knowledge from Web log files.
2. Created a new algorithm for efficient and scalable mining of Web
sequential patterns, called PrefixSpan with pseudo projection.
75
6.2 Future Works
The use of World Wide Web as the means for marketing and selling has
increased dramatically in the very recent past and has intensified the
need to understand user’s online behavior. Web Usage Mining is a new
research field that is avidly followed by many scholars and commercial
businesses. However, the available collections of information are very
limited and are not designed for such research, and there are no
established data mining techniques for such data sources. But, the list
of open problems and opportunities are also long. We will now introduce
some of them within the scope of our research context.
6.2.1 Real-time Multidimensional Analysis
The construction of the multidimensional data cubes for OLAP and
knowledge discover is very time consuming. Due to the volatile nature of
Web log data, most users only require simple usage statistics and canned
reports as they occur. As such, we need to improve our Web log
multidimensional analysis and Web Log parser to address real business
needs.
6.2.2 Scalability
Web logs will continue to grow in size and new sources of data will be
created for Web Usage Mining. PrefixSpan with pseudo projection
naturally affords to be extended to a parallel processing algorithm. Our
goal should be to handle extremely large data sets.
76
6.2.3 Incremental Algorithm
As the Web sites or business needs change over time, some sequential
patterns become invalid and some need to be updated. We need to
extend our algorithm to solve this problem.
6.2.4 Sequential Pattern Mining with Constraints
Sequential pattern mining algorithms tend to generate a huge number of
sequences, and at any given time, not all of those are of interest to the
user. For example, a marketing analyst may only be interested in the
activity of those online customers who have visited certain pages in a
specific time period. In general, the discovered patterns must meet
certain rules and conditions, which we categorize as follows.
x� Constraint on page-view attribute: Page-view attributes may include
page type, page name, access date, view time, and activities
associated with the page view, such as buying or traversal of a
certain hyper-link. These constraints are unary and enforce
certain limitation on a desired page view.
x� Constraint on user-visit attribute: User-visit attributes may include
length, duration, and minimum and maximum gap between page
views. The constraint may also include a regular expression to
impose restriction on pattern of page views, their view time and
gaps.
x� Constraint on user sequence attribute: User sequence attributes
may include duration, length, number of visits, average visit
length and duration, and minimum and maximum gap between
visits. The constraint may also include a regular expression to
77
impose restriction on repeat patterns among visits, visit duration
and gaps.
In conclusion, the importance of Web Usage Mining will continue to grow
with the popularity of WWW and undoubtedly will have a significant
impact on the study of online user behavior.
78
Bibliography
[A01] J.-M. Adamo. Data Mining for Association Rules and Sequential Patterns. Springer Verlag, New York, 2001.
[AIS93] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proc. Of the 1993 ACM SIGMOD Conference, pages 207-216, Washington DC, USA, May 1993.
[AS94] R. Agrawal and R. Srikant. Fast Algorithms for Mining Generalized Association Rules. In Proc. of the 20th International Conference on Very Large Databases (VLDB’94), pages 487-499, Santiago, Chile, September 1994.
[AS95] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of the 11th International Conference on Data Engineering (ICDE’95), pages 3-14, Taipei, Taiwan, March 1995.
[BBA+99] A. G. B FKQHU��0��%DXPJDUWHQ��6��6��$QDQG��0��'��0XOYHQQD��and J. G. Hughes. Navigation Pattern Discovery from Internet Data. In KDD Workshop on Web Usage Analysis and User Profiling (WebKDD’99), pages 25-30, San Diego, CA, USA, 1999.
[BL00a] J. Borges and M. Levene. Data Mining of User Navigation Patterns. In Web Usage Mining and User Profiling, B. Masand and M. Spliliopoulou, editors, Lecture Notes in Artificial Intelligence (LNAI 1836), pages 92-111, Springer Verlag, Berlin, 2000.
[BL00b] J. Borges and M. Levene. A Heuristic to Capture Longer User Web Navigation Patterns. In Proc. of the first International Conference on Electronic Commerce and Web Technologies (EC-Web’00), pages 155-164, London-Greenwich, U.K., September 2000.
[C93] E. F. Codd. Providing OLAP (On-Line Analytical Processing) to User-analysts: An IT Mandate. Technical Report TR-9300011, E. F. Codd and Associates, 1993.
[C97] J. G.-Cumming. Hits and Misses: A Year Watching the Web. In Proc. 6th Int’l World Wide Web Conf., Santa Clara, CA, USA, April 1997.
[C00] R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. Ph.D. Thesis, University of Minnesota, May 2000.
[CD97] S. Chaudhuri and U. Dayal. An Overview of Data Warehouse and OLAP Technology. SIGMOD Record, 26(1):65-74, March 1997.
79
[CMS97] R. Cooley, B. Mobasher, and J. Srivastava. Web Mining: Information and Pattern Discovery on the World Wide Web. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97), Newport Beach, CA, USA, November 1997.
[CMS99] R. Cooley, B. Mobasher, and J. Srivastava. Data Preparation for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems, 1(1):5-32, February 1999.
[CP95] L. Catledge and J. Pitkow. Characterizing Browsing Behaviors on the World Wide Web. Computer Networks and ISDN Systems, 27(6), North-Holland, 1995.
[CPY98] M.-S. Chen, J. S. Park, and P. S. Yu. Efficient Data Mining for Path Traversal Patterns. IEEE Trans. on Knowledge and Data Engineering (TKDE), 10(2):209-221, March 1998.
[EV97] S. Elo-Dean, M. Viveros. Data Mining the IBM Official 1996 Olympics Web Site. Technical report, IBM T.J. Watson Research Center, 1997.
[FPS96a] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowlede Discovery and Data Mining, G. Piatetsky-Shapiro and J. Frawley, editors, AAAI Press, Menlo Park, CA, 1996.
[FPS96b] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD Process for Extracting Useful Knowledge from Volumes of Data. In Communications of the ACM – Data Mining and Knowledge Discovery in Databases, 39(11):27-34, November 1996.
[GBL+96] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-tabs and Sub-totals. In Proc. Of the 12th Int’l Conference on Data Engineering (ICDE’96), pages 152-159, New Orleans, USA, 1996.
[GEN96] NetGenesis. http://www.netgen.com, 1996.
[HCC+97] J. Han, J. Chiang, S. Chee, J. Chen, Q. Chen, S. Cheng, W. Gong, M. Kamber, G. Liu, K. Koperski, Y. Lu, N. Stefanovic, L. Winstone, B. Xia, O. R. Zaïane, S. Zhang, and H. Zhu. DBMiner: A System for Data Mining in Relational Databases and Data Warehouses. In Proc. CAS-CON’97: Meeting of Minds, pages 249-260, Toronto, Canada, November 1997.
[HFW+96] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, and O. R. Zaïane. DBMiner: A System for Mining Knowledge in Large Relational Databases. In Proc. 1996 Int’l Conference Data Mining and Knowledge Discovery (KDD’96), pages 250-255, Portland, Oregon, USA, August 1996.
80
[HPM+00] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. C. Hsu. FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining. In Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), pages 355-359, Boston, MA, USA, Aug. 2000.
[HPM+01] J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In Proc. 2001 Int. Conf. on Data Ingineering (ICDE’01), Heidelberg, Germany, April 2001.
[IQ01] IBM Quest Data Mining Project Synthetic Data Generation Program: http://www.almaden.ibm.com/cs/quest/syndata.html.
[K95] W. Klösgen. Efficient Discovery of Interesting Statements in Databases. Journal of Intellegent Information Systems(JIIS), 4(1):53-69, January 1995.
[KB00] R. Kosala and H. Blockheel. Web Mining Research: A Survey. In SIGKDD Explorations, 2(1):1-15, July 2000.
[LOU95] A. Luotonen. The Common Logfile Format. http://www.w3.org/Daemon/User/Config/Logging.html, 1995.
[MCP98] F. Massegila, F. Cathala, and P. Poncelet. The Psp Approach for Mining Sequential Patterns. In Proc. European Symposium on Principle of Data Mining and Knowledge Discovery (PKDD’98), pages 176-184, Nantes, France, September 1998.
[MIN00] Mineit Software Ltd. Easyminer. http://www.mineit.com, 2000.
[MPC99] F. Masseglia, P. Poncelet, and R. Cicchetti. An Efficient Algorithm for Web Usage Mining. Networking and Information Systems Journal (NIS), 2(5-6):571-603, 1999.
[MPT99] F. Masseglia, P. Poncelet and M. Teisseire. Using Data Mining Techniques on Web Access Logs to Dynamically Improve Hypertext Structure. In ACM SigWeb Letters, 8(3):13-19, October 1999.
[MPT00] F. Masseglia, P. Poncelet, and M. Teisseire. Web Usage Mining: How to Efficiently Manage new Transactions and New Clients. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'00), pages 530-535, Lyon, France, September 2000.
[MT96] H. Mannila, and H. Toivonen. Discovering generalized episodes using minimal occurrences. In Proc. of the Second Int’l Conference on Knowledge Discovery and Data Mining (KDD’96), pages 146-151, Portland, Oregon, August 2-4, 1996.
[MTV95] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes in Sequnces. In Proc. of the First Int’l Conference on Knowledge Discovery and Data Mining (KDD’95), pages 210-215, Montreal, Quebec, 1995.
81
[OC96] T. Oates and P. R. Cohen. Searching for structure in multiple streams of data. In Proc. Of Thirteenth Int’l Conference on Machine Learning (ICML’96), pages 346-354, Bari, Italy, July 1996.
[P97] J. Pitkow. In search of reliable usage data on the www. In Sixth International World Wide Web Conference, pages 451-463, Santa Clara, CA, USA, April 1997.
[PHM+00] J. Pei, J. Han, B. Mortazavi-Asl and H. Zhu. Mining Access Patterns Efficiently from Web Logs. In Proceedings Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'00), pages 396-407, Kyoto, Japan, April 2000.
[PZL98] S. Parthasarathy, M. J. Zaki, and W. Li. Memory Placement Techniques for Parallel Association Mining. In 4th Int’l Conference on Knowledge Discovery and Data Mining (KDD’98), pages 304-308, New York, New York, August 1998.
[S99] S. Sarawagi. Explaining Differences in Multidimensional Aggregates. In Proc. Of 25th Int’l Conference on Very Large Data Bases (VLDB’99), pages 42-53, Edinburgh, Scotland, U.k., September 1999.
[S00] S. Sarawagi. User-Adaptive Exploration of Multidimensional Data. In Proc. Of 26th Int’l Conference on Very Large Data Bases (VLDB’00), pages 307-316, Cairo, Egypt, September 2000.
[SA96] R. Srikant, and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Fifth Int’l Conference on Extending Database Technology (EDBT’96), pages 3-17, Avignon, France, March 1996.
[SAM98] S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven Exploration of OLAP Data Cubes. In Proc. Of Extending Database Technology (EDBT’98), pages 168-182, Valencia, Spain, March 1998.
[SCD+00] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web Usage Mining: Discovery and Application of Usage Patterns from Web Data. SIGKDD Explorations, 1(2):12-23, January 2000.
[SS00] S. Sarawagi and G. Sathe. I3: Intelligent, Interactive Investigation of OLAP Cubes. In Proc. Of the 2000 ACM SIGMOD Int’l Conference on Management of Data, page 589, Dallas, Texas, USA, May 2000.
[SZA+97] C. Shahabi, A. Zarkesh, J. Adibi, and V. Shah. Knowledge Discovery from Users Web-Page Navigation. In Proceeding of the IEEE RIDE97 Workshop, pages 20-29, Birmingham, England, April 1997.
[WCA99] World Wide Web Committee Web Usage Characterization Activity. http://www.w3c.com/WCA, 1999.
[WEB95] Software Inc. Webtrends. http://www.webtrends.com, 1995.
82
[WYB98] K. Wu, P. S. Yu, and A. Ballman. Speedtracer: A web usage mining and analysis tool. IBM Systems Journal, 37(1):89-105, 1998.
[YJM+96] T. W. Yan, M. Jacobsen, H. G. Molina, and U. Dayal. From User Access Patterns to Dynamic Hypertext Linking. In Proceedings of the 5th International Wrold-Wide Web Conference, pages 7-11, Paris, France, May 1996.
[Z98] M. J. Zaki. Scalable Data Mining for Rules. Ph.D. Thesis, University of Rochester, 1998.
[Z99] O. R. Zaïane. Resource and Knowledge Discovery from the Internet and Multimedia Repositories. PhD thesis, School of Computing Science, Simon Fraser University, March 1999.
[Z00] M. J. Zaki. Parallel Sequence Mining on Shared-Memory Machines. In Large-Scale Parallel Data Mining, M. J. Zaki and C.-T. Ho, editors, Lecture Notes in Artificial Intelligence (LNAI 1759), Springer-Verlag, Berlin, 2000.
[ZXH98] O. R. Zaïane, M. Xin, J. Han, Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs. In Proceedings of Advances in Digital Libraries Conference (ADL’98) , pages 19-29, Santa Barbara, CA, USA, April 1998.