discovering and mining user web-page ... and mining user web-page traversal patterns by behzad...

DISCOVERING AND MINING USER WEB-PAGE TRAVERSAL

PATTERNS

by

Behzad Mortazavi-Asl

B.Sc., Simon Fraser University, 1999

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF

THE REQUIREMENTS FOR THE DEGREE OF

Master of Science

in the School

of

Computing Science

© Behzad Mortazavi-Asl 2001

SIMON FRASER UNIVERSITY

April 2001

All rights reserved. This work may not be

reproduced in whole or in part, by photocopy or other means, without permission of the author.

ii

Approval

Name: Behzad Mortazavi-Asl

Degree: Master of Science

Title of thesis: Discovering and Mining User Web-Page Traversal

Patterns

Examining Committee:

Dr. Lou Hafer

Chair

Dr. Jiawei Han, Senior Supervisor

Dr. Tsunehiko (Tiko) Kameda Supervisor

Dr. Ke Wang External Examiner Date Approved: ______________________________

iii

Abstract

As the popularity of WWW explodes, a massive amount of data is

gathered by Web servers in the form of Web access logs. This is a rich

source of information for understanding Web user surfing behavior. Web

Usage Mining, also known as Web Log Mining, is an application of data

mining algorithms to Web access logs to find trends and regularities in

Web users’ traversal patterns. The results of Web Usage Mining have

been used in improving Web site design, business and marketing

decision support, user profiling, and Web server system performance.

In this thesis we study the application of assisted exploration of

OLAP data cubes and scalable sequential pattern mining algorithms to

Web log analysis. In multidimensional OLAP analysis, standard

statistical measures are applied to assist the user at each step to explore

the interesting parts of the cube. In addition, a scalable sequential

pattern mining algorithm is developed to discover commonly traversed

paths in large data sets. Our experimental and performance studies

have demonstrated the effectiveness and efficiency of the algorithm in

comparison to previously developed sequential pattern mining

algorithms. In conclusion, some further research avenues in web usage

mining are identified as well.

iv

Dedication

To my parents

v

Acknowledgments

I would like to thank my supervisor Dr. Jiawei Han for his support,

sharing of his knowledge and the opportunities that he gave me. His

dedication and perseverance has always been exemplary to me. I am

also grateful to TeleLearning for getting me started in Web Log Analysis.

I owe a depth of gratitude to Dr. Jiawei Han, Dr. Tiko Kameda and

Dr. Wo-shun Luk for supporting my descision to continue my graduate

studies. I am also grateful to Dr. Tiko Kameda for accepting to be my

supervisory committee member. He has been most generous and

understanding with his time to read this thesis carefully and make

insightful comments and suggestions. I would like to thank Dr. Ke Wang

for being my external examiner. It was with his sharing of knowledge

and experience that I was able to improve the performance of the

algorithms to the level that is now.

Many people have helped and contributed their time to the

research of this thesis. Many thanks goes to Jian Pei who significantly

influnced the direction of my research. I am grateful to him for the

countless hours of constructive discussions, the opportunities for

research collaboration, and his informative reviews of my presentation. I

offer my deepest thanks to Kum Hoe Tung who unselfishly helped me

over the years by sharing his knowledge.

I would like to especially thank all who have made my study at

School of Computing Science at Simon Fraser University possible and

have inspired me over the years and have made this experience a

memorable one.

vi

Table of Contents

Approval .......................................................................................... ii

Abstract ........................................................................................... iii

Dedication ....................................................................................... iv

Acknowledgments ............................................................................ v

1. Introduction................................................................................. 1

1.1. Knowledge Discovery and Data Mining............................. 1

1.2. Motivation ....................................................................... 2

1.3. Thesis Outline ................................................................. 5

2. Related Work ............................................................................... 6

2.1. Data Sources ................................................................... 7

2.2. Web Usage Terms ............................................................ 9

2.3. Data Preprocessing .......................................................... 11

2.3.1. User Identification ............................................... 13

2.3.2. Session Identification .......................................... 13

2.3.3. Episode Identification .......................................... 14

2.4. Web Usage Mining ........................................................... 15

3. Discovering Web Access Patterns Using OLAP .............................. 18

3.1. Introduction .................................................................... 18

3.1.1. Web Log Dimensions ........................................... 18

3.1.2. Motivation ........................................................... 20

3.1.3. Contribution........................................................ 20

3.1.4. Data set .............................................................. 21

3.2. Partially Automated Exploration of Web Log Data Cubes.. 21

3.2.1. Architecture ........................................................ 22

3.2.2. Implementation ................................................... 23

vii

3.2.3. Illustrative Example ............................................ 24

3.3. Summary......................................................................... 26

4. Sequential Analysis of Web Traversal Patterns.............................. 27

4.1. Introduction .................................................................... 27

4.2. Problem Statement .......................................................... 29

4.3. Related Work ................................................................... 34

4.3.1. GSP Algorithm..................................................... 35

4.3.2. PSP Algorithm ..................................................... 39

4.3.3. Memory Management .......................................... 40

4.3.4. SPADE Algorithm ................................................ 41

4.4. FreeSpan: Pattern Growth via frequent item lattice .......... 43

4.4.1. Basic Idea ........................................................... 44

4.4.2. Level by Level FreeSpan Algorithm....................... 45

4.4.3. Alternative-Level FreeSpan Algorithm .................. 49

4.5. PrefixSpan: Pattern Growth via frequent sequence lattice . 53

4.5.1. Basic Idea ........................................................... 53

4.5.2. Level-by-Level PrefixSpan Algorithm .................... 57

4.5.3. Pseudo Projection ................................................ 60

4.5.4. Scaling up ........................................................... 63

4.6. Summary......................................................................... 64

5. Performance Analysis and Discussions......................................... 65

5.1. Synthetic Datasets........................................................... 65

5.2. Berkeley Web Log Dataset................................................ 66

5.3. Comparison of PrefixSpan with PSP+ and FreeSpan ......... 67

5.4. Scale up .......................................................................... 72

6. Conclusions and Future Work...................................................... 74

6.1. Conclusions..................................................................... 74

viii

6.2. Future Works .................................................................. 75

6.2.1. Real-time Multidimensional Analysis ................... 75

6.2.2. Scalability ........................................................... 75

6.2.3. Incremental Algorithm......................................... 76

6.2.4. Sequential Pattern Mining with Constraints......... 76

Bibliography .................................................................................... 78

ix

List of Figures

1.1 The core steps of knowledge discovery process ........................ 1

2.1 Web Usage Data Sources ........................................................ 7

3.1 Starnet query model of a Web log data cube............................ 19

3.2 Architecture for assisted exploration of Web log data cube ...... 22

3.3 Partially automated assisted data exploration ......................... 23

3.4 Hit counts over two weeks period............................................ 24

3.5 Top drill paths using maximum standard deviation................. 25

3.6 Request hits over 24 hours of day 7 ........................................ 26

4.1 Web Access Sequence Database.............................................. 33

4.2 The GSP algorithm.................................................................. 35

4.3 PSP prefix-tree vs. GSP hash-tree structures........................... 39

4.4 SPADE Id-list intersection....................................................... 42

4.5 Transaction database of sequence database............................ 44

4.6 Frequent item lattice of running example................................ 45

4.7 Section of projection databases in FreeSpan-1 ........................ 46

4.8 The FreeSpan-1 mining algorithm........................................... 48

4.9 Frequent item matrix of running example after second pass.... 49

4.10 Projection databases in FreeSpan-2............................... 50

4.11 The alternative-level FreeSpan mining algorithm ........... 51

4.12 Prefix-based traversal of frequent sequence lattice......... 54

4.13 Projected databases of frequent 1-sequences ................. 56

4.14 The PrefixSpan-1 mining algorithm ............................... 58

4.15 Projection databases in PrefixSpan-1............................. 59

4.16 Pseudo Projection with virtual-memory window ............. 60

5.1 Distribution of Web log sequence length ................................. 67

x

5.2 Performance Comparison: Synthetic Dataset........................... 68

5.3 Performance Comparison: Synthetic Datasets ......................... 69

5.4 Performance Study: Long Synthetic Datasets .......................... 70

5.5 Performance Study: Web Log Dataset...................................... 72

5.6 Scale up: Number of Customers.............................................. 72

xi

List of Tables

2.1 Web Log Field Description....................................................... 8

2.2 Web Usage Terms and Definitions........................................... 10

2.3 Sample ECLF Web log file ....................................................... 12

3.1 First order statistics by column .............................................. 25

5.1 Synthetic data generation program’s parameters..................... 65

5.2 Synthetic data sets ................................................................. 66

1

Chapter 1

Introduction

Web Usage Mining is the automatic discovery of user access patterns

from Web servers. Organizations collect large volumes of data in their

daily operations, generated automatically by Web servers that are

collected in Web access log files. Analysis of these access data can

provide useful information for server performance enhancements,

restructuring a Web site, and direct marketing in e-commerce. In this

thesis we study the multidimensional analysis of Web logs and automatic

discovery of frequently traversed paths in Web sites.

1.1 Knowledge Discovery and Data Mining

Figure 1.1: The core steps of knowledge discovery process

Data

Data Integration

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

2

Data Mining and Knowledge Discovery in Databases (KDD) is defined as

the process of automatic extraction of implicit, novel, useful, and

understandable patterns in large databases. There are many steps

involved in the Data Mining process, which include data cleaning and

preprocessing, data integration, data selection, data transformation and

reduction, data-mining task and algorithm selection, and lastly post-

processing and interpretation of discovered knowledge [FPS 96a; FPS

96b]. This process tends to be highly interactive, incremental and

iterative. Figure 1.1 illustrates the core steps of knowledge discovery

process.

In general there are two levels of data mining. The descriptive level

is more interactive, ad-hoc and query driven [FPS 96a]. It involves the

traditional multidimensional analysis and reporting operations that are

provided by OLAP techniques, such as drilling down, rolling up, slicing

and dicing, over the existing data in data warehouses. The predictive

level data mining is however more automatic. It involves the application

of different data mining algorithms to task-relevant data to discover new

implicit patterns. The mining tasks performed by such algorithms

include: association rule mining, sequential rule mining, classification,

clustering and similarity search. In this thesis we look at the application

of both types of techniques to Web Usage Mining.

1.2 Motivation

The use of World Wide Web as the means for marketing and selling has

increased dramatically in the very recent past. Almost every major

company has their own web site that acts at least as a promotional tool

to increase awareness of the company and its products. As the e-

3

commerce activities become more important, organizations must spend

more time to provide the right level of information to their customers.

How can you tell what contents are being read, whether a web site is

effective, or how users read the information?

Web Usage Mining is the application of established data mining

techniques to analyze Web site usage. For an e-commerce company this

means detecting future customers likely to make a large number of

purchases, or predicting which online visitors will click on what ads or

banners based on observation of prior visitors who have behaved both

positively and negatively to the advertisement banners.

The sources of data involved in such an analysis may include

Content and structure of the Web site, demographic data about the users,

and Web site usage data gathered in Web logs. In this thesis we mainly

focus on the available information in Web log files. Web log files,

however, are transaction oriented and were not designed with these

questions in mind. Therefore, Web log files need to be cleaned,

summarized and transformed before processing. In Chapter 2 we look at

the available sources of Web logs and the preprocessing tasks required

for these types of data sources. We will also look at some of the recent

and interesting work done in this field.

Most server analysis packages lack the ability to provide any true

business insights about visitors’ online behavior. Current traffic analysis

tools, like Accrued, Andromedia, HitList, NetIntellect, NetTracker, and

WebTrends provide high-level predefined reports about domain names, IP

addresses, browsers, cookies, and other server activities. These types of

reports aim at providing information on the activity of the server rather

than the user. Because of the time variant and multi-dimensional

4

nature of the Web logs, we could leverage on the existing OLAP

technology and provide a multi-dimensional browsing capability [ZXH98].

In Chapter 3 we explore the multidimensional analysis of Web log files,

which could also provide insight into the user behavior.

In addition, detecting user navigation paths and analyzing them

may result in a better understanding of how users visit a site, identify

users with similar information needs, or even improve the quality of

information delivery in WWW using dynamic or personalized web pages.

Unfortunately, it is difficult to perform such user-oriented data mining

directly on raw user access logs, as these logs tend to be error-prone

(missing fields or values), incomplete and ambiguous. They are

incomplete due to the presence of cache hits, proxy servers and stateless

nature of HTTP transport protocol which makes the task of identifying

users and their visits less percise. They are ambiguous as they are

transaction oriented and were designed to simply record the access of

each server resouce but we need to extract abstractions that are suitable

for Web user characterization analysis. They also lack the business

information required as a frame of context. Therefore, it requires a

delicate preprocessing stage to clean and prepare the data for Web

traversal analysis. We will expand on these shortcomings more in

Chapter 2, and in Chapter 4 we will develop a sequential pattern mining

algorithm suited for the long sequences and large data sets that can be

generated from the Web log files.

In those Chapters we will explain in more detail our motivations for

each part of the work.

5

1.3 Thesis Outline

This thesis is organized into 6 chapters. Chapter 2 introduces Web

Usage Mining and the most recent research related to our work. In

Chapter 3 we present our implementation of the multidimensional

analysis of Web log files. In Chapter 4 we analyze the existing sequential

pattern mining algorithms and develop a fast and efficient algorithm for

path traversal analysis of Web logs. In Chapter 5 we present our

experimental results and discuss the limitations and strengths of our

approach. In Chapter 6 we summarize the technical contributions of

this thesis and present possible directions for future research.

6

Chapter 2

Related Work

Web Mining is the application of data mining techniques to content or

activities related to WWW. Zaïane in his PhD thesis [Z99] classifies Web

Mining into three domains: Web Content Mining, Web Structure Mining

and Web Usage Mining. Web Content Mining is the process of extracting

knowledge from the context of Web sites, for example contents of

documents or their description. Web Structure Mining, on the other

hand, uses the links and references in Web pages to infer interesting

knowledge, such as identifying authoritive- and hub-pages. Authoritive

or high quality pages are inferred by the number of hyperlinks pointing

to the page by different page authors on the Web. A hyperlink in this

case is considered as the author’s endorsement of the other page. The

collective endorsement of a given page by different authors may indicate

the importance of the page. On the other hand, a hub is a Web page or

set of Web pages that provide a collections of links to authoritive-pages.

Web Usage Mining, also known as Web Log Mining, mines Web access

logs for interesting patterns in WWW traffic. In this chapter, we will

review the recent advances in the field of Web Usage Mining. In Section

2.1 we introduce the available data sources for such a study. Section 2.2

identifies the published Web usage terms. The required preprocessing

steps are presented in Section 2.3. Finally, we study some of the

interesting works done in Web Usage Mining in Section 2.4.

7

2.1 Data Sources

Any type of Web usage mining requires having an accurate picture of the

WWW traffic. This section explores the available data sources and their

properties. As shown in Figure 2.1 the data sets commonly used for Web

usage mining are collected at the server-level, proxy-level or client-level.

Each data source differs in terms of format, accuracy, scope and method

of implementation.

Figure 2.1: Web Usage Data Sources

Client-level logs hold the most accurate account of user behavior

over WWW. They can be implemented using applets, Java-Scripts,

cookies and modified browsers and by far are the most controversial

methods in terms of user privacy issues [SZA+97, EV97]. If a client

connection is through an Internet Service Provider (ISP) or is located

behind a fire-wall, its activities may be logged at this level. The primary

function of proxy servers or fire-walls is to serve both as a measure of

security to block unwanted users or as a cache resource to reduce

network traffic by reusing their most recently fetched files. Their log files

may include many clients accessing many Web servers. In the log files,

their client request records are interleaved in their received order. The

process of logging is automatic and requires less intervention compared

Client-level logs Proxy-level logs Server-level logs

8

to client-level logging. Its format is dependent on the logging software.

Its accuracy is however diminished by the client-level cache as some

requests are not received by the proxy and are served from the most

recently fetched files stored at the client computer. Server-level logs are

the most commonly used source for Web usage mining. Most Web

servers provide the option of storing log files in the Common Log Format

(CLF), Extended Common Log Format (ECLF) or proprietary format

[LOU95]. Table 2.1 describes the fields in both formats.

Table 2.1 Web Log Field Description

Every CLF log entry has the following format:

HostID rfc931 authuser [date offset] “method URI protocol” status bytes

ECLF logs in addition also contains referrer page and agent fields.

Following is a sample of ECLF log entry. Note fields rfc931 and authuser

are empty (indicated by value ‘-’).

142.107.30.180 - - [25/Mar/1999:23:01:41 –800] “GET my.html HTTP/1.1” 200

4219 /Index.html Mozilla(IE4.2, Win95)

However, there exist some drawbacks to using server-level Web logs.

Term Description Remote host remote host name or IP address. Rfc931 remote login name of the client. Auth. User server authenticated client name. Date date and time of request. Offset local time offset from Greenwhich time Method Method of request (Get, Post, Head, etc). URI full page address or request as it came from the client. Protocol http communication protocol used by the client. Status http server status sent to the client. Bytes number of bytes transferred. Referrer URI that request originated from. Agent OS and browser software at the client.

9

x� Due to client and proxy-level caches, not all hits are captured in

the server-level Web logs.

x� Page view time duration may be inaccurate. If a hit is answered by

a client or proxy cache (cache hit), then the view time of its

previous page will be interpreted to be longer than it actually was.

x� When the user id is not available and clients are behind a proxy,

many hits will be recorded with the same host name (proxy’s host

name). As a result, page views may seem to be erratic, with a very

short viewing times.

One method to reduce cache hits is to ensure page view name (URL) or

the request line is unique across different visits. This can be

accomplished by dynamically generating each Web page and appending a

unique session id to each hyperlink Web page name. For a complete

discussion of the shortcomings of the current log standard and potential

solutions the interested reader can refer to [P97].

2.2 Web Usage Terms

In order to create some consistency in the discussions that follow, we

adopt the Web term definitions published by World Wide Web Committee

Web Usage Characterization Activity (W3C WCA) [WCA99] that is relevant

to Web usage mining. They are listed in Table 2.2.

A user is identified as an individual or an automated application,

such as a Web Crawler, that is accessing files from different servers over

WWW. As mentioned in the previous section, due to proxy servers many

users may have the same host name and in most cases the rfc931 and

authuser fields in Web logs may be empty. In such cases, only the host

10

name with the combination of agent information, if available, is used to

identify users.

Term Description Server A role adapted by an application supplying resources. Proxy An intermediary application acting both as a server

and client. Client A role assumed by an application when retrieving

resources from the server. User A person using a client application to interact and

retrieve resources from the server. User session Set of user clicks across one or more servers. Server session Set of user clicks as recorded by a single server (also

known as visit). Episode Subset of related user clicks in user session. Web page Collection of resources identified by a single URI. Page view The rendered Web Page in a specific client

application. Click-stream A sequential series of page views by a user.

Table 2.2 Web Usage Terms and Definitions

A page view consists of a set of resources, such as one or more

html files, graphics, etc., used to render a requested URI in user’s

browser. Each resource will be logged separately in a Web log as they

are delivered. A page view is usually the result of a user’s single mouse

click on a hyperlink. Most page views contain multiple frames, html and

graphics files, some of which may be used in multiple page views. This

makes the job of identifying a page view more difficult. In addition, more

and more Web pages are generated dynamically using, for example, a cgi

with different parameters. In these cases, we must also take into

consideration the parameters of the call in order to map each request to

a different page view.

11

A click-stream is defined as a time-ordered list of page views.

User’s click-stream over the entire Web is called the user session.

Whereas, the server session is defined to be the subset of clicks over a

particular server by a user, which is also known as a visit. As mentioned

in the previous section, in the presence of cache hits it is difficult to

reconstruct an accurate picture of the user’s click-stream.

Within the click-stream of a visit or a user session, an episode is

defined as a set of sequentially or semantically related clicks. This

relation depends on the goals of the study.

2.3 Data Preprocessing

One of the important core steps of knowledge discovery is data

preprocessing. The main goal of this step is to create minable objects for

knowledge discovery despite the presence of ambiguities and

incompleteness in data. This step is highly data-source dependent. The

techniques used to overcome these shortcomings may vary greatly from

one data source to the other. Therefore, in this section we focus on

techniques used to preprocess server-level Web access log files, namely

CLF and ECLF. A sample ECLF Web log is shown in Table 2.3.

In the previous section we mentioned some of the shortcomings of

using Web access log files with respect to W3C WCA Web usage terms.

In general, there are three tasks to be performed in the preprocessing

stage: User, session and, optionally, episode identification.

12

142.59.243.3

142.59.243.3

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

142.59.243.146

Host name

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Rfc931

-

-

-

-

-

-

-

-

-

-

-

-

-

-

-

Auth. User

[08/Feb/2001:19:53:18 –800]

[08/Feb/2001:19:53:06 –800]

[08/Feb/2001:19:53:01 –800]

[08/Feb/2001:19:52:45 –800]

[08/Feb/2001:19:52:34 –800]

[08/Feb/2001:19:52:34 –800]

[08/Feb/2001:19:52:25 –800]

[08/Feb/2001:19:52:24 –800]

[08/Feb/2001:19:51:04 –800]

[08/Feb/2001:19:51:07 –800]

[08/Feb/2001:19:51:07 –800]

[08/Feb/2001:19:51:06 –800]

[08/Feb/2001:19:51:04 –800]

[08/Feb/2001:19:51:04 –800]

[08/Feb/2001:19:51:04 –800]

Date

“GET F.html HTTP/1.1”

“GET G.html HTTP/1.1”

“GET 2.cgi? HTTP/1.1”

“GET B.html HTTP/1.1”

“GET H.html HTTP/1.1”

“GET A.gif HTTP/1.1”

“GET I.html HTTP/1.1”

“GET E.html HTTP/1.1”

“GET C.html HTTP/1.1”

“GET D.html HTTP/1.1”



“GET B.html HTTP/1.1”



Method/URL/Protocal

200

200

301

200

200

304

200

404

200

200

304

200

200

200

200

Status

2937

1680

152

1300

2762

-

1152

-

1210

1814

-

1152

1300

2890

1152

Bytes

http://www.cs.sfu.ca/B.html

http://www.cs.sfu.ca/B.html

-

http://www.cs.sfu.ca/A.html

http://www.cs.sfu.ca/I.html

http://www.cs.sfu.ca/index.html


-

http://www.cs.sfu.ca/D.html



http://www.cs.sfu.ca/index.html -




Referrer

Mozilla/3.0 (IE5.0; WinNT)


Mozilla/3.04 (Win95,I)













Agent

Tab

le 2

.3 S

ample

EC

LF W

eb log

file

13

2.3.1 User Identification

In the best case, we can rely on the values in fields rfc931 and/or

authuser to accurately identify a user. But in most cases, fields rfc931

and authuser are empty. In the absence of such information, host name

and user agent information are the only available choices to identify a

user. This is assuming every user has a unique IP address within which

only one type of browser is operated. However, this is not necessarily

correct. As stated in [SCD+00], the following are the cases that break the

assumption.

x� As shown in Figure 2.1, several users may access a server

through a single proxy potentially at the same time.

x� Some ISPs or privacy tools randomly assign an IP address to

each user’s request.

x� Some repeat users access the Web each time from a different

machine.

x� A user may operate many browsers of different types at the

same machine and potentially at the same time.

Given the sample Web log shown in Table 2.3 and using the combination

of host name and agent fields, IP address 142.59.243.146 is responsible

for two users (lines 1-3, 8-10 and 4-7, 11-13) and IP address

142.59.243.3 for one (lines 14-15).

2.3.2 Session Identification

Once a user has been identified or approximated, the ordered click-

stream of each user must be divided into server-sessions or visits. It is

done by identifying the last page view of each visit. Without the presence

14

of explicit sign-out event or access to complete user-session, page view

time can be used to determine if a user has continued the same visit by

selecting another page. Catledge and Pitkow in [CP95] have studied user

page view time over WWW and have recommended thirty minutes of

inactivity as an indication of sign-out event. Note that since a user may

not be interested only in the pages of one site or potentially leave and re-

enter the same site at different intervals, session identification may also

become a difficult task.

2.3.3 Episode Identification

Finally after the user session has been identified we could optionally

break it into semantically meaningful subsets as episodes. Maximum

Forward Reference is such a relation [WYB98]. It is defined as a set of

page views up to the page view where backward reference is made. For

example, for a user session that contains the following ordered pages A-

B-A-C-D-C, maximal forward references for the session would be A-B and

C-D. In an e-commerce site an episode can also be defined using page

type and site structure, or particular action, such as clicking an

advertising icon or adding an item to customer basket. It is based on the

motivation of finding the sequence of pages that lead to a particular

action or page. Cooley et al. [CMS97] also introduces Reference Length

Module and Time Window Module as means of episode identification. The

former only allows a set maximum view time for each page, based on its

classified page type of navigation or content. A new episode is started

when the view time of a page exceeds its maximum set view time. The

latter, on the other hand, only allows a maximum time window for each

15

episode. It is based on the assumption that a meaningful episode has an

average time length associated with them.

2.4 Web Usage Mining

Having introduced the data sources, the terms and the required

preprocessing steps for Web Usage Mining, we now turn our attention to

the recent advances in the field of Web Usage Mining. In the recent

years there has been an increasing number of research work done in

Web usage mining [MT96, YJM+96, C97, CMS97, CPY98, ZXH98,

WYB98, JJK99, BBA+99, CMS99, MPC99, MPT99, PHM+00, BL00a,

BL00b, KB00, MPT00, SCD+00]. The main motivation of these studies is

to get a better understanding of the reactions and motivations of

customers who shop through electronic premises of a company in WWW

or of users who are simply browsing these premises. Some studies also

apply the mining results to improve the design of Web sites, analyze

system performance and network communications or even build adaptive

Web sites. In general, there are two main goals in the application of

discovered knowledge in Web usage mining: General Access Pattern

Tracking for understanding access patterns and trends and Customized

Usage Tracking for adapting and personalizing browsing experience for

the users. The former is the goal of this study.

One of the first works in Web log analysis belongs to [YJM+96],

where non-sequential and weighted vector of visited pages of a user is

used to assign the user to an existing cluster of users. Then system

dynamically suggests links based on the set of visited pages by other

users in the same cluster. Authors in [MT96] consider Web log as a

single sequence of events and propose an algorithm to discover frequent

16

event sequences, where fields in a Web log record are different attributes

of an event. The authors in [CMS97, CPY98, WYB98] present algorithms

to identify episodes in user sequences as discussed in the previous

section and apply association and sequential pattern mining algorithms

respectively to mine the user episodes. In Chapter 4 we will present in

more detail the related work in sequential pattern mining.

Since the size of Web log files grows quite rapidly, normal

algorithms may not be scalable. In addition, the contents of most web

sites change over time, rendering some parts of Web logs irrelevant for

the current analysis. Also the goal of analysis may change over time with

the business needs. Therefore, on the one hand, there is a need,

perhaps both in terms of memory and disk space, for scalability

improvements, and, on the other hand, for the introduction of

constraints into mining algorithms, such as time constraints, concept

hierarchies and frequent pattern templates, for discovering relevant and

correct knowledge.

One scalable and flexible approach to mining Web log files is with

the use of data warehouses, data cubes and OLAP techniques. On-Line

Analytical Processing (OLAP) and data cubes [GBL+96] have recently

become accepted as powerful tools for strategic analysis of databases in

business settings. Han et al. [ZXH98] has shown that some of the

analysis needs of Web usage data can be done using data warehouses

and OLAP techniques. In the preprocessing stage of this approach, the

parser does not filter out any records. All fields of the Web log records

are loaded into a relational table. As Web log data tend to be very large,

with some level of summarization, ad-hoc query and analysis using OLAP

activities such as roll-up and drill-down of significant amount of data

17

can also be feasible. In Chapter 3 we will present our work on

multidimensional analysis of Web log files.

There are also several commercially available Web server log

analysis tools, such as [WEB95, GEN96, MIN00]. Webtrends provides a

limited reporting mechanisms, such as user and page access statistics.

On the other hand, NetGenesis and Easyminer are more comprehensive

and include sequential and clustering algorithms as well as various

visualization packages into their product. However, they do not allow

integration of Web server log data with existing business data as in

[ZXH98]. Their algorithms are questionable in terms of scalability and do

not provide the facilities for verification-driven data mining such as

OLAP.

18

Chapter 3

Discovering Web Access Patterns Using OLAP

In this chapter we present the design and implementation of our Web log

data mining system for multi-dimensional analysis of Web log data. It is

a simple yet scalable and effective method for such an analysis. For this

purpose, we use the following data mining techniques: data cubes

[GBL+96], On-Line Analytical Processing (OLAP) [C93, CD97], and a

partially automated discovery-driven method for data exploration.

3.1 Introduction

Web log data cubes are constructed to give the user the flexibility of

viewing data from different perspectives and performing ad hoc analytical

quires. A typical Web log ad hoc analysis example is querying how

overall usage of Web site has changed in the last quarter, testing if most

server requests have been answered, hopefully with expected or low level

of errors. If some weeks or days are worse than the others, the user

might navigate further down into those levels, always looking for some

reason to explain the observed anomalies. At each step, the user might

add or remove some dimension, changing their perspective, select subset

of the data at hand, drill down, or roll up, and then inspect the new view

of the data cube again. Each step of this process signifies a query or

hypothesis, and each query follows the result of the previous step.

3.1.1 Web Log Dimensions

Figure 3.1 depicts a Starnet query model of a Web log data cube. This

starnet shows 10 dimensions and four measures that can be constructed

19

directly from Web log, where the hierarchies built for file dimension (URL

of the requested resource) are based on the site’s directory structure and

file type, action and host address dimension’s hierarchies are predefined.

Figure 3.1: Starnet query model of a Web log data cube

For more business oriented analysis, other attributes from other data

sources, such as marketing and sales databases, can be added to this

model. Additional attributes may include user demographic data, such

as address, age, race, education, and income and business data, such as

product, revenue and marketing strategies or class, conference, tutorial

and instructor can also be added.

minute hour day week year Time

dir level 1

File

dir level 2 dir level 3

file name

level 1

level 2

level 3

File type

file ext.

level 1

level 2

host name

Host/IP address Method

method name

protocol name

Protocol

status code Server status

User

user id

Agent

agent

Action Measures

byte size

hits

view time

session count

level 1

level 2

level 3

20

3.1.2 Motivation

OLAP data cubes, especially Web logs, are often too large to browse for

the purpose of decision support, even after some level of summarization

of the raw data. In the paradigm of assisted discovery-driven exploration

of data, researchers focus on full or partial automation of data

exploration process [K95, SAM98, S99, S00, SS00]. Often an analyst

explores the OLAP data cubes looking for small regions of data that are

indicative of rare events, exceptions or new opportunities. But

sometimes, the analyst explores the data cubes to simply monitor strong

existing trends or to look for new ones. Typically, the user starts by

selecting a set of dimensions at a certain aggregation level, and visually

inspects the data or its graphs. At this point the user may be modifying

his/her hypothesis or needs, or forming new ones. Based on that, the

user changes his/her footprint on the Starnet model and inspects the

new view of the data cube again, and so on. This is a daunting task and,

without help, the user can easily overlook a potential discovery or take

the wrong turn at any point.

3.1.3 Contribution

During such explorations, a user normally can only analyze and

understand a few dimensions at a time. Therefore, most of the existing

works aim at fully automating the process by pre-mining the data. They

provide different statistical models, such as diviation from norm using

multivariate analysis, to find exceptions, taking into account the effects

of multiple dimensions and their hierarchies. However, most users are

not Statistics experts to understand the meaning of such exceptions.

21

Our aim is to partially automate the process by providing simple aids

based on first-order Statistics functions, so the user can make more

informed decision at each step of the exploration. We have developed a

Cell Statistics Annotation Add-in component that extends the functionality

of Microsoft Excel’s PivotTable and has been integrated into the OLAP

mining system of DBMiner [HFW+96, HCC+97].

3.1.4 Data set

In this work we have selected the Web access logs gathered by Computer

Science Department of University of California Berkeley, as our test data

set. The Web logs span over the first two weeks of June 1999 and

display the following characteristics: 165 MB in size, 47,472 unique host

names, 48,600 unique URLs, 296 different file types (extensions) and

900,037 requests. Using the file extensions we have grouped the files

into 9 different categories of audio, compressed, document, dynamic,

html, images, java and video. For those IP addresses that have been

resolved into host names, we have grouped them according to their

rightmost two or three characters into different Internet domains. The

following 7 dimensions and two measures have been used to construct a

Web log OLAP data cube using Microsoft OLAP Server: time, file, file type,

host, method, protocol and server status dimensions and bytes and hits

measures.

3.2 Partially Automated Exploration of Web Log Data Cubes

Motivated by the wide spread popularity of Microsoft Excel, its PivotTable

and Microsoft OLAP Server, which facilitates powerful multidimensional

22

analysis of data warehouses, we have developed a Cell Annotation Add-in

component for Microsoft Excel.

3.2.1 Architecture

An Add-in component is written in VBA (Visual Basic for Applications)

language for a particular Microsoft Office application where after loading

the component extends the functionalities of that application.

Figure 3.2: Architecture for assisted exploration of Web log data cube

As shown in Figure 3.2, in this architecture the cell annotation add-in

component exists only within the confines of Microsoft Excel application,

relying on the objects that Excel exposes and its own GUI objects to

communicate with the user. PivotTable is another Microsoft object that

Web log files

Database

Data cleaning & transformation

Data warehouse

Data Cube

OLAP Server

PivotTable API

Cub files

Filtering Data Integration

Excel User GUI API

Add-in GUI API

Cell Statistics Annotation Add-in

23

allows multidimensional view of data cubes. We use the embedded

PivotTable object within Excel and its API to query the underlying data

sources, be they OLAP Server data cubes or cub files (off-line data

cubes). Currently, the tasks of cleaning Web logs and creating data

cubes are done by separate processes.

3.2.2 Implementation

The Add-in component provides two types of functionalities: “statistics on

current page” and “assisted drilling”. Figure 3.3 shows the interface for

these functionalities.

Figure 3.3: Partially automated assisted data exploration

Current page statistics provides a summary of different regions of the data

(row, column or plane) that is displayed for the current footprint of

starnet query model. The summaries include min-max, median,

standard deviation, variance and average by row, column and plane. The

summary results and annotated cells, where applicable, help the user to

24

understand the current data. Assisted drilling provides the means for the

user to choose a drill path. The statistical functions included for this

functionality include: min, max, min median, max median, min variance,

max variance, min std dev, max std dev, min average, and max average.

Once the user has selected a statistical function and dimensions for

drilling, the component performs the drill function on all the members of

the selected dimensions and uses the data in that plane to calculate the

value for the chosen statistical function. By comparing all the calculated

values, Assisted drilling then highlights up to five cells, representing the

potential drill paths.

3.2.3 Illustrative Example

Figure 3.4 shows part of the Berkeley’s Web log cube. In it the Excel

table shows part of the cube with 2 dimensions: DateTime and

ServerStatus, where DateTime is at level Day. Server status 200 indicates

a successful request, while 403 and 404 mean request for forbidden URL

and requested URL was not found, respectively.

Figure 3.4: Hit counts over two week period

25

When browsing this view, the user may pay special attention to areas

that display anomalous trends. We can easily notice hits during week 24

start up much lower than week 23 but sharply increase over the next two

days and then decrease dramatically. Using the program, we gather the

first order statistics of the current view by column which are summarized

in the following table.

Function 200 403 404

Min 18,221 93 631 Max 105,322 2,281 7,219 Median 83,264 & 83,245 763 & 782 3,712 & 3,810 Average 75,906.2 840.8 3,732.1 Std. dev. 27,754.8 585.0 1,654.7

Table 3.1: First order statistics by column

We can see the data in all columns are skewed and the hit count for

server status 403 and 404 are unusually high in day 7 with respect to

successful hits. Instead of drilling down on all available paths and

inspecting their standard deviation, the user can invoke the program to

find them. Figure 3.5 highlights the top 5 such drill paths.

Figure 3.5: Top drill paths using maximum standard deviation

26

Drilling down along the DateTime dimension over day 7 yields 72

different hit counts as shown in Figure 3.6. Notice how the error hits are

high over a period of three hours. This shows how our partially

automated data exploration method, even though not very complex, can

alleviate the problems associated with data exploration.

Figure 3.6: Request hits over 24 hours of day 7

3.3 Summary

Many Web log analysis tools only provide a simple set of statistics, such

as hit counts and distributions based on time, file and geographic

regions. Wwwstat (http://www.ics.uci.edu/pub/websoft/wwwstat) and

Analog (http://www.statlab.cam.ac.uk/~sret1/analog) are among this

type of tools. We believe this type of analysis still provides some benefit

to the user. Therefore, in this chapter we presented a system that

combines multidimensional data cube, OLAP and partially automated

discovery-driven method for data exploration technologies to provide a

more scalable and powerful tool for such analysis.

27

Chapter 4

Sequential Analysis of Web Traversal Patterns

After cleaning the Web log files, sequential pattern discovery techniques

[MTV95, SA96] can be applied to discover [-frequent patterns. In this

chapter we present a new efficient and scalable sequential pattern-

mining algorithm, called PrefixSpan with Pseudo-Projection.

4.1 Introduction

The mining of sequence of Web traversal patterns is to discover a set of

attributes, such as Web page view, shared across time among a large

number of user sequences in a given Web access log database. For

example, consider the server-level Web access log database, where the

desired attributes represent user’s Web page view, and each record

represents each Web user’s time-ordered set of visits over a period of

time. The discovered patterns are the page views most frequently

accessed within and across visits by the Web users. For example:

“The Data Webhouse Toolkit” book’s page view occurs before the

“Data Mining Your Website” book’s page view 71% of the time in

the same visit.

The mining task of finding all frequently occurring traversal [-patterns in

a large database is quite challenging. As proposed in [Z98, A01] the

search space can be identified with the powerset � �PA2 , where A is the set

of desired attributes in database and P is the length of the largest [-

frequent Web user sequence that can be discovered. In popular Web

sites such as Yahoo.com and amazon.com with millions of repeat

customers and thousands of available Web pages, the length of

28

sequences grow very long, increasing the search space exponentially.

Furthermore, potentially not all records in the database contribute to the

support of [-frequent sequences. Therefore, in large databases there also

exists the problem of minimizing disk I/O. Yet, most sequential pattern

mining algorithms [SA96, CPY98, MCP98] are iterative and step-wise in

nature, where after generating longer candidate sequences from the

previously found [-frequent sequences the database is scanned again in

full to check the support of each candidate sequence. These algorithms

use large pointer based dynamic data structures, such as hash trees and

lists, where the task of choosing the place the memory is allocated from

is left to the operating system. It is observed that the access patterns of

these recursive data structures are highly ordered. Therefore, with

respect to access patterns of the internal data structures, the data

locality tipcally is not optimal in these algorithms [PZL98].

In this chapter, we present a new algorithm to overcome these

limitations. The main contributions are as follows:

1. We use a similar technique to that of vertical id-list database

format proposed in [Z98], called pseudo-projection to manage

projection-databases and yet maintain high data locality.

2. We introduce a new measure, called selectivity to control physical

materialization of projection-databases, thereby minimizing

unnecessary disk I/O.

Our algorithm not only strives to minimize I/O cost and maintain high

data locality but also reduces the search space by applying prefix

pattern-growth method, which will be described in detail in Section 4.5.1.

In comparison to the SPADE algorithm of [Z98] that requires a

preprocessing overhead to convert the database to the vertical format

29

and is only efficient when database fits into memory, our method can

also handle gigabyte size databases.

The rest of the chapter is organized as follows: In Section 4.2 we

describe the problem of mining sequences of Web traversal patterns,

followed by Section 4.3, where we look at the related work. We look at

our previously developed algorithm in Section 4.4 and then introduce our

new algorithm in Section 4.5. Finally, we conclude this chapter in

Section 4.6 by summarizing our work.

4.2 Problem Statement

Each Web server registers every request it serves in a Web log file. Each

entry of Web log represents an access to a single Web server resource,

such as an image or html file. As explained in Chapter 2, a Web log file

generally follows the Common Log Format (CLF) [LOU95]. In association

rule mining [AIS93, AS94], a transaction is defined as a set of items

bought by a customer in a single purchase. The sequential pattern

mining later introduced in [AS95, SA96] defines a sequence as a time-

ordered set of transactions, whereas in sequential mining of Web

traversal patterns, each Web log entry is a separate customer

transaction. Like Cooley in [C00], we identify a user visit as a set of page

views that are sufficiently close over time by using a maximum time gap

� �t'max specified by the user. We identify a page view as an html or a

dynamically generated file that is sufficiently apart over time from the

previously identified page view by using a minimum time gap � �t'min

specified by the user. A Web access sequence is then defined as a time-

ordered set of visits.

30

Definition 4.2.1: Let ^ `nrrrR ,,, 21 � be the set of ids for the available

resources in a Web site. Let ^ `mpppP ,,, 21 � be the set of possible

events or page views identified in a Web site, where ^ `isiii rrrp ,,, 21 � , and

for Rrsjmi ij �dddd ,1,1 . Let ^ `qlllL ,,, 21 � be a set of log entries (Web

access transactions) in the server Web access log file. Without loss of

generality, we define each Ll � as a triple resourceidtimeuseridl ,, , where

Rresourceidl �. .

In an ideal situation, we can detect each user’s page view from the server

Web access log. However, due to the presence of cache-hits not all hits

are recorded in L. Therefore, we need to approximate a page view.

Definition 4.2.2: Let ^ `meeeE ,,, 21 � be the set of identified Web log

page views or events, where each Ee� is defined as a triple,

� �iuiii lllhitstimeuseride ,,,,, 21 � , such that for mi dd1 , uj dd1 , Llij � &

userideuseridl iij .. & timeltimel jiij .. )1( �d & ttimeltimel ijji 'd�� min..)1( &

� �timeltimee ijuji .max. 1 �� . Let user visit be defined as a triple

� �qeeetimeuseridv ,,,,, 21 � , where for useridVuserideqi i ..,1 dd &

timeetimee ii .. 1�d & ttimeetimee ii 'd�� max..)1( & � �timeetimev iqi .max. 1 �� . A visit

with k page views or events is said to have length kv or be a k-visit.

Note that page views are the basic minable units in Web log visits not the

resource hits. Figure 4.1 shows the user visits of our running example.

Definition 4.2.3: User Web access sequence is a time-ordered set of

visits. It is defined as a triple � �nvvvtimeuserids ,,,,, 11 � , where for

ni dd1 , useridsuseridv i .. & timevtimev ii .. 1�� & � �timevtimes ini .max. 1 �� . An

access sequence s that consists of k page views ( ¦�

n

jjvsk

1

|| ) is also

31

called a k-sequence. Then Web Access Sequence database is defined as

^ `msssWAS ,,, 21 � , where each WASs� is a user Web access sequence.

As a comparison with sequential pattern mining problem defined in

[SA96], user Web access sequences are made up of time-ordered item

sets or elements where each item is a page view or an event. But items

are not necessarily unique in an element. Repetition is allowed within

elements of a Web access sequence. For example, � �2,1,1 and � �2,1 are two

different elements.

Definition 4.2.4: Sequence of events in visit jeeeV ccc c ,,, 21 � is a

subsequence of visit � �,,,, 21 kjeeeV k d � and V a super-sequence of V c ,

denoted as VV v%c , if and only if there exist integers kiii j d��d �211

such that jijii eeeeee c c c ,,,

21 21 � . Event Sequence V c is a proper

subsequence of V , denoted as VV v%c , if and only if VV v%c and VV zc .

Similarly, access sequence mvvvS ccc c ,,, 21 � is a subsequence of access

sequence nvvvS ,,, 21 � � �nm d , and S a super-sequence of Sc , denoted

as SS s%c , if and only if there exist integers niii m d��d �211 such that

mivmiviv vvvvvv %�%% ccc ,,,21 21 . Access Sequence Sc is a proper subsequence of

S , denoted as SS v%c , if and only if SS v%c and SS zc . Subsequence Sc is

also called the prefix of S , if and only if, for ii vvmi c�d ,1 and for

,,,, 21 jm eeev ccc c � km eeev ,,, 21 � � �kj d , jl dd1 , ll ee c . From here on we

will refer to relations ,,, vsv %%% or s% simply as ,% or % , respectively,

when the distinction is clear from the context.

For example, let us consider that a given Web user has accessed the

following page views 1, 2, 3, 4, 5, according to the following sequence:

� �� 543,21 S . This means that only page views 2 and 3 were accessed

32

in the same visit, and since 5 S , the access sequence is a 5-sequence.

Access sequence � �� 21 is a subsequence of S , because � � � �11 % and

� � � �3,22 % . However, the sequence � �� 32 is not a subsequence of S .

Access sequence � �� 21 is also a prefix of S , but sequence � �� 31 is not.

Our aim is to detect frequent Web traversal [-patterns. For such a

task, we require to discover the sequence S in WAS database with a

support value above the user-defined support value.

Definition 4.2.5: Let the support of access sequence S in

^ `msssWAS ,,, 21 � database be defined as � � ^ `m

sSsS ii

WAS

%|sup and also

be denoted as � �Ssup if the database is clear from the context. Given a

user-defined minimum support value ([), sequence S is said to be

frequent or a [-pattern if the condition � �tSsup [ holds. A [-frequent

sequence S is called maximal if there exists no [-frequent sequence S c

that SS c% .

Note that any given sequence in the database only contributes to the

support of sequential pattern S once, even if the subsequence S occurs

many times in that sequence.

Problem Statement: Given a Web access sequence database and a

minimum support threshold [, the problem of Web traversal pattern

mining is to enumerate the complete set of [-frequent sequences.

Running Example: Consider the Web access sequence database shown

in Figure 4.1 that is used as a running example in this chapter. The

database contains the following set of page views {1, 2, 3, 4, 5, 6, 7, 8},

which has been identified in a given Web site, has 4 users, and ten visits

in total. The figure also shows 0.5-frequent 1-, 2-, 3-, and 4-sequences

33

Figure 4.1: Web Access Sequence Database

Userid Time Visit 1 10 3,4

1 15 1,2,3

1 20 1,2,6

1 25 1,3,4,6

2 15 1,2,6

2 20 5

4 10 4,7,8

4 20 2,6

4 25 1,7,8

3 10 1,2,6

User Visits Web Access Sequence Database

25 (3,4)(1,2,3)(1,2,6)(1,3,4,6)

2 20 (1,2,6)(5)

3 10 (1,2,6)

4 25 (4,7,8)(2,6)(1,7,8)

1 Userid Time Sequence

4

(2) 4

(4) 2

(6) 4

(1)

Frequent 1-Sequnces

Sequence Sup

3

(1,6) 3

(2)(1) 2

(2,6) 4

(1,2)

Frequent 2-Sequnces

Sequence Sup

(4)(1) 2

(4)(2) 2

(4)(6) 2

(6)(1) 2

2 (4)(2,6)(1)

Frequent 4-Sequnces

Sequence Sup

3

(2,6)(1) 2

(4)(2,6) 2

(4)(2)(1) 2

(1,2,6)

Frequent 3-Sequnces

Sequence Sup

(4)(6)(1) 2

{ }

(1)

(1,6)

(2) (4) (6)

(2,6) (2)(1) (4)(1) (4)(2) (4)(6) (6)(1) (1,2)

(4)(6)(1) (1,2,6) (2,6)(1) (4)(2)(1) (4)(2,6)

(4)(2,6)(1)

34

that can be found in the database with a minimum support threshold of

50% or 2 sequences. It is also clear from the generated lattice of 0.5-

frequent sequences that there exists two maximal 0.5-frequent

sequences, namely � �6,2,1 and � �� 16,24 .

4.3 Related Work

The sequential pattern-mining problem was first introduced by Agrwal

and Srikant in [AS94], in which three algorithms were presented. The

algorithm AprioriAll was the only one that found all [-frequent patterns

and was shown to perform better or equal to the other two algorithms.

The same authors in a later work [SA96] presented the GSP algorithm

that outperforms AprioriAll by up to 20 times.

At nearly the same time, Mannila et al. in [MTV95] presented the

problem of finding [-frequent episodes in only one long sequence of

events. An episode is defined as a set of events occurring with a partially

defined order and within a given time bound. In a later work [MT96],

they generalized their work to allow one to express arbitrary unary

conditions on the individual event attributes, or to give binary conditions

on the pairs of event attributes. Their experiments were performed using

a Web server-level logs file. In comparison to our work, we find [-

frequent episodes across many sequences of events. In addition, we are

interested in finding all [-frequent episodes without any imposed

constraints.

Oates and Cohen in [OC96] introduced the problem of detecting

strong dependencies among multiple streams of data. Their measure of

dependency strength is based on the statistical measure of non-

independence. As opposed to our work, the MSDD algorithm proposed in

35

their paper detects unexpectedly frequent or infrequent [-patterns and

also the algorithm generates rules rather than [-frequent sequences.

However, their rule enumeration method is similar in nature to our

frequent pattern growth method in PrefixSpan.

4.3.1 GSP Algorithm

Before we proceed, we need to review the GSP algorithm [SA96] in more

detail as it forms the basis of our comparison algorithm. The algorithm

is based on the a priori heuristic proposed in association mining [AIS93].

The heuristic states the fact that any super-pattern of a non-frequent [-

pattern cannot be [-frequent. The two key features of GSP algorithm is

candidate generation followed by complete pass of the database for

support counting. The GSP algorithm is shown in Figure 4.2.

Input: A sequence database S, and minimum support threshold min_sup Output: The complete set of frequent sequential patterns Method: ^ `sequencesfrequentF � 11

1FF // Set of all frequent sequences.

for (k=2; z� 1kF ø; k++) do

kC = Set of candidate k-sequences

for all sequences s in database S do increment count of all unique sCk %DD |�

^ `supmin_sup.| t� DD kk CF

� kFFF

return F

Figure 4.2: The GSP algorithm

The first step is to compute the support of each item ([-frequent 1-

sequences) in the database. This set is used as an initial seed set for the

second step. From this initial seed set the set of candidate 2-sequences

36

are built. A candidate 2-sequence can consist of a single element with

two [-frequent items, or two elements, each having only one [-frequent

item, such as � �1,1 and � �� 11 . Therefore, if 10 [-frequent 1-sequences

were found in the first pass, GSP generates 20010101010 u�u candidate

2-sequences for Web traversal patterns. Another pass is made over the

database to find the actual support of each candidate 2-sequence and

prune the non-[-frequent 2-sequences. From this point on, any step k

uses the [-frequent (k-1)-sequences as its initial seed set and performs

the following two phases:

x� Candidate generation: for any given pair of [-frequent (k-1)-

sequences � �ss c, , where discarding the first item of s and last item

of sc results in identical sequences, create a new candidate k-

sequence by appending the last element of sc to s . All the k-2

remaining (k-1)-subsequences of the newly created candidate k-

sequence must also be [-frequent. The new item is added as a new

element if it was a separate element in sc , otherwise it is added to

the last element of s . Hence, it uses the a priori heuristic. For

example, given the following three [-frequent 2-sequences � �2,1 ,

� �3,2 , and � �3,1 , the first and the second sequence can be

matched by dropping item 1 and 3 from first and second sequence

respectively. Therefore, we can create the candidate sequence

� �3,2,1 . Note that item 3 is added to the same element because it

was within the same element in the second sequence � �3,2 .

x� Support counting: given the candidate k-sequence set kC , scan

the database once and obtain the actual support of each candidate

k-sequence. By discarding sequences that do not satisfy the

37

minimum required support, [-frequent k-sequence set kL is

formed.

The process continues until no more candidate sequences can be formed.

To access candidate sequences efficiently during support counting phase,

candidate sequences are stored in a hash-tree structure. The algorithm

is optimized during the first and second iterations, by using array and

matrix structures, respectively, to directly access candidate sequences.

There are inherent draw-backs to this approach that are independent of

the implementation techniques used, which are related to minimizing

search space, database scan, and maintaining internal data structures.

Lemma 4.3.1: Let n denote the total number of [-frequent items. Then the

total number of k-sequences is given as: kk nu� 12 .

Proof: First, we prove that a set S of n elements has precisely n2

subsets, where the empty set and S itself are counted as subsets. The

difference between the number of subsets with k elements in set S with

n elements and S c with n+1 elements is:

� ��

� � � �� ¸̧

¹

·¨̈©

§�

��

u

��u��

�

��

� ¸̧

¹

·¨̈©

§�¸̧

¹

·¨̈©

§ �

1!1!

!

!1!

!1!1

!!

!

!1!

!11

k

n

knk

nk

knk

nknn

knk

n

knk

n

k

n

k

n.

Therefore, the difference between the total number of subsets in sets S

and S c is ¦¦¦�

�

��

¸̧¹

·¨̈©

§ ¸̧

¹

·¨̈©

§�

¸̧¹

·¨̈©

§�

�¸̧¹

·¨̈©

§�

� n

i

n

i

n

i i

n

i

n

i

n

n

n

0

1

11 111

1, which is the total number of

subsets in set S . Hence, it follows that the total number of subsets

doubles from the set S with n elements to the set S c with n+1 elements.

Since the total number of subsets for sets with 0 and 1 element are

00

0

210

¸̧¹

·¨̈©

§¦

�i i and 1

1

0

221

¸̧¹

·¨̈©

§¦

�i i, respectively, our formula is true for 0 k

38

and 1 k . Assuming that the formula is true for the integers k,,3,2 � ,

based on the proven doubling effect, the number of subset for the set

with k+1 elements is 1

0

1

0

221 �

�

�

�

¸̧¹

·¨̈©

§u ¸̧

¹

·¨̈©

§ �¦¦ k

k

i

k

i i

k

i

k. Second, we will count the

number of ways in which a k-sequence can be constructed and then

assign items for each arrangement. The number of ways a k-sequence

can be constructed is given by the number of ways a k-sequence can be

partitioned into different number of visits. For a k-sequence, there exits

k-1 partition points, with ¸̧¹

·¨̈©

§�

�¸̧¹

·¨̈©

§ �¸̧¹

·¨̈©

§ �¸̧¹

·¨̈©

§ �

1

1,,

2

1,

1

1,

0

1

k

kkkk� ways to partition

the k-sequence into k,,3,2,1 � number of visits, respectively. For example,

for a 3-sequence we have 2 partition points and four ways to partition

(x,x,x), (x)(x,x), (x,x)(x), and (x)(x)(x). Hence, the total number of ways

that a k-sequence can be partitioned equals to the total number of

subsets of a set with k-1 elements, which is proven to be equal to 12�k .

We now assign events to each position of each k-sequence. Since

repitition is allowed within a visit, we have n choices for each position, or

kn choices for each k-sequences. Hence the total number of k-sequeces

with n number of frquent items is kk nu� 12 . �

First, obviously by applying a priori heuristic, when possible, we

significantly reduce the search space. However, we truly reap its benefits

when k>2. Still, when the number of [-frequent 1-sequences is large, for

example 1000, a priori-based method generates a very large set of

candidate 2-sequences (2,000,000). This requires a significant amount

of resources. Secondly, when the [-frequent sequences form a very

dense lattice, like DNA databases, despite the benefits of the a priori

39

heuristics, the number of candidate k-sequences still grows

exponentially.

Second, in each pass k, all k-subsequences of each database

sequence are searched in the candidate set kC . Therefore, each database

pass becomes significantly costlier than the previous one. This also

becomes more evident with long database sequences.

Third, maintaining an efficient data structure can be beneficial

both in the candidate generation and support counting phases. This is

the approach taken by the PSP algorithm [MCP98].

4.3.2 PSP Algorithm

The algorithm still follows the approach of candidate generation followed

by support counting. However, the authors use prefix-tree instead of

hash-tree as their internal data structure. Using our running example in

Figure 4.1, Figure 4.3 depicts the internal data structures of GSP and

PSP after the candidate 3-sequences have been generated.

Figure 4.3: PSP prefix-tree vs. GSP hash-tree structures

Depth 0

(1,2,6) (4)(1,2) (4)(1,6) (4)(2,6) (4)(2)(1)

Root

1 2 4

2 6 1 2 6

1 2 6 6 1 1 6

PSP Candidate Tree

Intra-item Inter-item

1,2 0

4,6 1

GSP Candidate Hash Tree Hash Function: h(1)=h(2)=0; h(4)=h(6)=1

Leaves

1,2 0

4,6 1

1,2 0

4,6 1 Depth 1

(2,6)(1) (4)(6)(1)

40

During the support counting phase in GSP, for all k-subsequences of

each database sequence, the hash-tree is navigated until reaching a leaf

node storing several candidates. Then each candidate sequence is

examined for a match. However, using the prefix-tree in PSP, the search

for a k-subsequence is terminated when for any j<k, no j-subsequence is

present in the prefix-tree. Once the leaf-node is reached, the only

operation to perform is to increment its support value. Other benefits of

using prefix-tree are that it requires less storage space and improves

efficiency during the candidate generation phase. This approach proves

to be more efficient than GSP. For more detail, please refer to [MCP98].

4.3.3 Memory Management

Note that during the candidate generation phase internal data structures

are traversed in a depth first fashion. But during support counting

phase depth first search may terminate early if the next item does not

exist in the candidate tree. Parthasarathy et. al. [PZL98] have shown

that memory placement policies that take into account the knowledge of

internal data sturcuture and their usage can enhance performance

significantly. Simple Placement Policy (SPP) of allocating memory from

the same region shows an improvement of 35-55% over the algorithm

that uses the system’s malloc function directly. In addition to SPP,

depth-first ordering of the allocated memory adds an additional overhead

of less than 2% of the running time. However, the gain in locality in

small databases with short [-frequent sequences is not sufficient to

overcome this overhead. But, as the length of database sequences and

its [-frequent sequences become longer the payoff quickly adds up in

each database pass. Since our experiements use data sets with long

41

sequences and use support thresholds that also generate long [-frequent

sequences, we allocate the internal data structures in a depth-first

manner. This forms the basis of our comparison algorithm, which is the

improved PSP algorithm, using a depth-first memory placement policy.

We will refer to this algorithm as PSP+.

4.3.4 SPADE Algorithm

In this section we explain the SPADE algorithm. Major emphasis is given

to lattice decomposition and vertical database format. For more detail,

we refer the reader to [Z98].

All the algorithms explained so far work on a horizontal database

format as depicted in Figure 4.1. In horizontal format the database

contains a list of customers (cid), each with its own list of item sets or

elements (tid). On the other hand, the SPADE algorithm works on a

vertical database format, where each k-sequence is associated with an

id-list. Id-list contains the customer id (cid) and element id (tid) pairs of

all sequences that support its associated k-sequence. Figure 4.4 shows

section of the 0.5-frequent sequence lattice of our running example and

its associated id-list. The support of any k-sequence is determined by

intersecting the id-list of two of its generating (k-1)-sequences that

shares the same suffix. The number of unique customer ids in the id-list

determines the support of its associated k-sequence. Note that the

intersection algorithm assumes all items in an element have occurred at

the same time.

A priori-based algorithm traverses the [-frequent sequence lattice

using a breadth-first search, where all [-frequent (k-1)-sequences are

42

generated before [-frequent k-sequences. However, a depth-first search

is also possible.

Figure 4.4: SPADE Id-list intersection

The [-frequent sequence lattice is decomposed into equivalence classes

such that each equivalence class can be processed independently. Let

> @� denote the equivalence class of the [-frequent k-sequences, such that

applying the candidate generation rules described for GSP can generate

all [-frequent sequences with the prefix �. In fact both prefix [Z98] and

suffix [Z00] can be used as an equivalence class relation. Figure 4.4

shows the equivalence classes of [(1)], [(4)], and [(6)] using suffix relation.

Note that all [-frequent 1-sequences belong to the equivalence class of

[ø], which has an empty suffix. Also this process of decomposition can

TID CID

(4)

10 1

25 1

10 4

TID CID

(6)

20 1

25 1

15 2

10 3

20 4

TID CID

(4)(1)

10 1

10 4

TID CID

(1)

15 1

20 1

25 1

15 2

10 3

25 4

TID CID

(4)(6)(1)

10 1

10 4

Intersect (4)(1) and (6)(1)

TID CID

(6)(1)

20 1

20 4

Intersect (6) and (1)

(4)(6)

{ }

(6)

(4)(1) (6)(1)

(4)(6)(1)

(1) (4)

[(1)] [(6)] [(4)]

43

be applied recursively. For additional details on splitting the search

space, we refer the reader to [A01].

Before processing the root equivalence classes from the initial

decomposition however, the id-list for that class must be scanned from

disk into memory. Obviously, the size of these id-lists shrinks as the

sequence length increases. Even if we had enough memory to hold id-list

of [-frequent 1-sequences, it may be costlier to compute [-frequent 2-

sequences using vertical database format. The question is what is the

suitable level for decomposition? The author suggests following GSP

algorithm with horizontal database format until the equivalence class of

a [-frequent k-sequence fits into memory. At the worst case, when

following a depth first approach, we only need to hold 2 id-lists for each

of the two consecutive levels of an equivalence class. Once we have

generated the id-list of the next level we no longer need the previous

level’s id-list. There are some drawbacks to this approach however.

First, since we only use two id-lists to generate the next level’s [-frequent

sequences and do not fully take advantage of the a priori heuristic; we

may unnecessarily examine a larger candidate sequence search space.

Second, due to the limited available memory, this method requires

multiple scans of the database to generate the id-lists for the members of

the root equivalence class. Suitable memory management techniques

may reduce the unnecessary disk I/O.

4.4 FreeSpan: Pattern Growth via frequent item lattice

In our recent study [HPM+00], we developed a projection-based algorithm

called FreeSpan, which uses the frequent item lattice to partition the

database. In this Section 4.4.1, we introduce the idea of pattern growth

44

via frequent item lattice with an example. The algorithm FreeSpan-1 is

then introduced in Section 4.4.2. We then improve its performance by

introducing FreeSpan-2 in Section 4.4.3.

4.4.1 Basic Idea

Definition 4.4.1: Let � �s) be defined as a function that returns the set

of items in sequence s, called transaction pattern of sequence s. Let

� � � �^ �̀ ) �< DD |, ss % be defined as a function that returns the set of

subsequences of s that have transaction pattern �.

Figure 4.5: Transaction database of sequence database

Figure 4.5 shows the transaction pattern database derived from our

running example database. The FreeSpan algorithm is based on the

following relationship between these two databases.

Observation 4.4.1: A sequence cannot be [-frequent if its transaction

pattern is not [-frequent. The reverse, however, may not be true as in

creating a transaction pattern we discard the ordering and repetition of

items in the sequence. Therefore, many sequences may contribute to the

support of the same transaction pattern.

Our aim is to search the frequent item lattice and use it to

partition the database such that the size of the partitions are minimized

Web Access Sequence Database

25 (3,4)(1,2,3)(1,2,6)(1,3,4,6)

2 20 (1,2,6)(5)

3 10 (1,2,6)

4 25 (4,7,8)(2,6)(1,7,8)

1 Userid Time Sequence

{1,2,3,4,6}

{1,2,5,6}

{1,2,6}

{1,2,4,6,7,8}

Transaction pattern

45

and each branch can be processed independently. At each node of the

lattice we only print [-frequent sequential patterns that have such a

transaction pattern. Figure 4.6 shows one possible frequent item lattice

for our running example. The question is which frequent item lattice

results in greater efficiency? We will strive to minimize the size of each

database partition.

Figure 4.6: Frequent item lattice of running example

4.4.2 Level by Level FreeSpan Algorithm

In this section we explain the FreeSpan-1 algorithm that traverses the

frequent item lattice level by level, depth-first.

Definition 4.4.2: Let f-list be defined as the list of [-frequent items ([-

frequent 1-sequences) of database S in the descending support order.

Let � �slistfj ,,,1 ��* be defined as a function that returns the projected

subsequence s%D , such that its transaction pattern � �D) contains items

of j,� and only items in f-list that have support greater than item j. Let

^ `j:� -projected database of S be defined as the set of projected

{1} {2} {4} {6}

{1,4} {2,4} {4,6} {1,6} {2,6}

{1,2,6}

{1,2}

{1,2,4} {1,4,6} {2,4,6}

{1,2,4,6}

{ }

46

sequences of S with respect to f-list of S, using function � �slistfj ,,,1 ��* .

We will refer to ^ `j:� -projected database of S as �|S , where � j� U ,

when the distinction between � and j is not necessary.

Figure 4.7: Section of projection databases in FreeSpan-1

In our example database, f-list equals to (1, 2, 6, 4) and projected

sequence of ^ ` � � � �� 6,4,3,16,2,13,2,14,3,4,6,2,1,2,41* is � �� 4,12,12,14 .

According to this f-list, the complete set of sequential patterns in the

database can be partitioned into 4 independent projected databases: ^ 1̀: -

projected, ^ `2: -projected, ^ `6: -projected, and ^ `4: -projected database, as

shown in Figure 4.7.

(3,4)(1,2,3)(1,2,6)(1,3,4,6) (1,2,6)(5) (1,2,6) (4,7,8)(2,6)(1,7,8)

(1)(1)(1) (1) (1) (1)

(1,2)(1,2)(1) (1,2) (1,2) (2)(1)

(1,2)(1,2,6)(1,6) (1,2,6) (1,2,6) (2,6)(1)

(4)(1,2)(1,2,6)(1,4,6) (4)(2,6)(1)

{:1}-Proj DB {:2}-Proj DB {:6}-Proj DB {:4}-Proj DB

f-list = {1,2,6,4)

f-list = { } output <(1)>

f-list = {1) output <(2)>

f-list = {1,2) output <(6)>

f-list = {1,2,6) output <(4)>

(1,2)(1,2)(1) (1,2) (1,2) (2)(1)

{2:1}-Proj DB

f-list = { } output <(1,2)>,<(2)(1)>

(1)(1,6)(1,6) (1,6) (1,6) (6)(1)

{6:1}-Proj DB

f-list = { } output <(1,6)>,<(6)(1)>

(1,2)(1,2,6)(1,6) (1,2,6) (1,2,6) (2,6)(1)

{6:2}-Proj DB

f-list = {1} output <(2,6)>

(1,2)(1,2,6)(1,6) (1,2,6) (1,2,6) (2,6)(1)

{2,6:1}-Proj DB

f-list = { } output <(1,2,6)>, <(2,6)(1)>

(4)(1,2)(1,2)(1,4) (4)(2)(1)

{4:2}-Proj DB

f-list = {1} output <(4)(2)>

4)(1,2)(1,2)(1,4) (4)(2)(1)

{2,4:1}-Proj DB

f-list = { } output <(4)(2)(1)>

(4)(1)(1)(1,4) (4)(1)

{4:1}-Proj DB

f-list = { } output <(4)(1)>

47

Even though, in average, sequences in ^ `4: -projected database can

be longer than sequences in its other sibling projection databases (more

items in their transaction pattern), but it has fewer sequences. Whereas,

^ 1̀: -projected database has more sequences but each sequence is

potentially shorter. Also note that infrequent items are not present in

projected databases and the total space requirement of all projected

databases may be larger than the original database. As can be seen in

Figure 4.7, it is possible for a projected database not to shrink. For

example, ^ `6: -projected, ^ `2:6 -projected and ^ 1̀:6,2 -projected databases

have the same size.

The mining process, as shown in Figure 4.8, consists of four main

steps performed on each database �|S recursively. Note, when ^ ` U ,

projection database �|S represents the original database.

x� Scan projection database �|S once and find its [-frequent items

that are not in U . At the same time count the support of all

sequences with transaction pattern U .

x� Print all [-frequent sequences with transaction pattern U .

x� If printed any [-frequent sequence and f-list is not empty, then for

each item j in the f-list create a ^ `j:U -projected database. Scan

projection database �|S a second time and populate each ^ `j:U -

projected database.

x� Recursively mine each newly created ^ `j:U -projected database.

Since the mining task performed for each node in the [-frequent item

lattice is confined to a smaller database that with each recursion

potentially becomes smaller, FreeSpan-1 is more efficient than PSP+.

48

Input: A sequence database S, and minimum support threshold min_sup Output: The complete set of frequent sequential patterns Method: F = {ø} Call FreeSpan-1({ø}, 0, S) return F // U : projection pattern of projection database �|S .

// l: length of projection pattern, U .

// �|S : projection database.

Subroutine FreeSpan-1( U , l, �|S )

for all sequences s in projected database �|S do

increment count of each sequence � �s,UD <� in candidate set �C

increment count of each item � � U�)� jsj | in candidate set 1C

^ `supmin_sup.| t� DD �� CF

� �FFF

^ `supmin_sup.|11 t� DD CF

if 00 1 FOrF� then return

f-list = support descending order of 1F


for each item � � listfjjsj ��)� &| U do

add � �slistfj ,,,1 �* U to ^ `j:U -projected database

for each 1Fi� do

Call FreeSpan-1( � iU , l+1, ^ `i:U -projected database)

return

Figure 4.8: The FreeSpan-1 mining algorithm

The major costs associated with FreeSpan-1 lies within two main

tasks performed in each projection database. First, in function � �s,U< , it

is costly to find all sequences that match a transaction pattern. The

candidate set �C can become very large. That is the reason why the task

of finding [-frequent sequences has been divided and delegated among

child projection databases. Second, in terms of storage and I/O costs, it

is expensive to create each projection database.

49

In each recursion of FreeSpan-1, the length of projection patterns

increase by 1. To delay projection, an alternative-level projection method

is developed, where, in each recursion, projection pattern grows by 2.

4.4.3 Alternative-Level FreeSpan Algorithm

In this section we explain the FreeSpan-2 algorithm that traverses the

frequent item lattice by 2 levels in each recursion, depth-first.

Similar to FreeSpan-1, after the first scan of database, we

construct an f-list. In our example database, f-list equals to (1, 2, 6, 4).

Then, we construct a frequent item matrix F to count the occurrence

frequency of each pair of items in the f-list, as follows. For f-list

� �niii ,,, 21 � , F is a triangular matrix > @kjF , , where nj d�1 and jk �d1 .

> @kjF , contains one counter for the number of occurrences of items ji

and ki together, i.e. the sequence contains any of these subsequences

� � � �kj ii , � � � �jk ii , � �kj ii , , or � �jk ii , .

Figure 4.9: Frequent item matrix of running example after second pass

In our example database, the f-list contains 4 items. We construct

a 44u triangular frequent item matrix, where each counter is initialized

to 0. We scan the database a second time to fill up the matrix, as

follows. For example, Using only the 0.5-frequent items of the first

sequence 6),2,6)(1,4,(4)(1,2)(1 , we increase > @1,4F and > @3,4F counter by 1,

since subsequences (4)(1) and (6)(4) respectively occur here. Figure

4.9 shows the resulting frequent item matrix after the second pass.

4

4

2

4

2 2

2

6

4

1 2 6

50

Figure 4.10: Projection databases in FreeSpan-2

Definition 4.4.3: Let � �slistfkj ,,,,2 ��* be defined as a function, where

)()(& kOrderjOrderkj listflistf ��!z and the function returns the projected

subsequence s%D , such that its transaction pattern � �D) contains items

of kj,,� and only items in f-list that have support greater than item j.

Let ^ `kj,:� -projected database of S be defined as the set of projected

sequences of S with respect to f-list of S, using function

� �slistfkj ,,,,2 ��* . We will refer to ^ `kj,:� -projected database of S as

�|S , where ^ `� kj,� U , when the distinction between � and ^ `kj, is

not necessary.

(3,4)(1,2,3)(1,2,6)(1,3,4,6) (1,2,6)(5) (1,2,6) (4,7,8)(2,6)(1,7,8)

f-list = {1,2,6,4) output <(1)>, <(2)>, <(6)>, <(4)>

(1,2)(1,2)(1) (1,2) (1,2) (2)(1)

{:1,2}-Proj DB

f-list = { } output

(1)(1,6)(1,6) (1,6) (1,6) (6)(1)

{:1,6}-Proj DB

f-list = { } output <(1,6)>,<(6)(1)>

(1,2)(1,2,6)(1,6) (1,2,6) (1,2,6) (2,6)(1)

{:2,6}-Proj DB

f-list = {1} output <(2,6)>, <(1,2,6)>, <(2,6)(1)>

{:2,4}-Proj DB

f-list = {1} output <(4)(2)>, <(4)(2)(1)>

(4)(1)(1)(1,4) (4)(1)

{:1,4}-Proj DB

f-list = { } output <(4)(1)>

(4)(1,2)(1,2)(1,4) (4)(2)(1)

{:6,4}-Proj DB

f-list = {1, 2} output <(4)(6)>, <(4)(2,6)>, <(4)(6)(1)>

(4)(1,2)(1,2,6)(1,4,6) (4)(2,6)(1)

{6,4:1,2}-Proj DB

f-list = { } output <(4)(2,6)(1)>

(4)(1,2)(1,2,6)(1,4,6) (4)(2,6)(1)

51

Input: A sequence database S, and minimum support threshold min_sup Output: The complete set of frequent sequential patterns Method: F = {ø} Call FreeSpan-2({ø}, 0, S) return F // U : projection pattern of projection database �|S .

// l: length of projection pattern, U .


Subroutine FreeSpan-2( U , l, �|S )


increment count of each sequence � �s,UD <� in candidate set �C

increment count of each item � � U�)� jsj | in candidate set 1C

^ `supmin_sup.| t� DD �� CF

� �FFF

^ `supmin_sup.|11 t� DD CF

if 00 1 FOrF� then return

f-list = support descending order of 1F


fill up frequent item matrix for each item � � listfjjsj ��)� &| U do

increment count of each seq. � �sj,�UD <� in candidate set 1��C

^ `supmin_sup.|11 t� �� DD CF

if 01 �F then return

� 1� FFF


for each item � � listfkjkjskj ��)� ,&,|, U do

add � �slistfkj ,,,,2 �* U to ^ `kj,:U -projected database

for each frequent pair ^ `�kj, frequent item matrix do

Call FreeSpan-2( ^ `� kj,U , l+2, ^ `kj,:U -projected database)

return

Figure 4.11: The alternative-level FreeSpan mining algorithm

52

The frequent item matrix is used to generated a set of projected

databases. Using the frequent item matrix in Figure 4.9, the database

can be partitioned into 6 independent projected databases: ^ `2,1: -

projected, ^ `6,1: -projected, ^ `6,2: -projected, ^ `4,1: -projected, ^ `4,2: -

projected, and ^ `4,6: -projected database, as shown in Figure 4.10.

The FreeSpan-2 algorithm is presented in Figure 4.11. Each

recursion costs 3 database scans. The main steps performed on each

database �|S is outlined bellow. Note, when ^ ` U , projection database

�|S represents the original database.

x� Scan projection database �|S once and find its [-frequent items

that are not in U . At the same time count the support of all

sequences with transaction pattern U .

x� Print all [-frequent sequences with transaction pattern U .

x� If printed any [-frequent sequence and f-list is not empty, then

scan database a second time and fill up frequent item matrix. At

the same time count the support of all sequence with transaction

pattern � ��ilistfi U�� .

x� Print all [-frequent sequences with transaction pattern

� ��ilistfi U�� .

x� If printed any [-frequent sequence and frequent item matrix

contains [-frequent item sets of length 2, then for each [-frequent

pair � � � �kOrderjOrderkj listflistf ��!|, in the matrix create a ^ `kj,:U -

projected database. Scan projection database �|S a third time and

populate each ^ `kj,:U -projected database.

x� Recursively mine each newly created ^ `kj,:U -projected database.

The degree to which a projected database shrinks with respect to its

immediate parent is dependent on the selectivity of projection pattern

53

and number of infrequent patterns present. Unfortunately, this

selectivity factor is database dependent. In the next section we present

an algorithm that displays a greater degree of selectivity per recursion.

4.5 PrefixSpan: Pattern Growth via frequent sequence lattice

In this section, we introduce a new projection-based method for mining

sequential patterns, called PrefixSpan [HPM+01], which uses the frequent

sequence lattice to partition the database. Section 4.5.1 introduces the

theory behind the algorithm. Then level-by-level PrefixSpan algorithm is

presented in Section 4.5.2. To improve its efficiency, pseudo projection

technique is proposed in section 4.5.3. Finally, section 4.5.4 addresses

issues concerning its scalability.

4.5.1 Basic Idea

Definition 4.5.1: Following Definition 4.2.4, let access sequence

mvvvS ccc c ,,, 21 � be a prefix of access sequence nvvvS ,,, 21 � � �nm d ,

such that for ii vvmi c�d ,1 and for ,,,, 21 jm eeev ccc c � km eeev ,,, 21 �

� �kj d , jl dd1 , ll ee c . We call access sequence nmmm vvvvS ,,,, 21 ��cc cc a

suffix of S with respect to Sc , where kjjm eeev ,,, 21 �� cc . Let this relation

be denoted as SSS c cc / or SSS cc�c . If element mv cc is empty, we will call

the suffix sequence an inter-suffix sequence, otherwise intra-suffix

sequence and denote it as � ccS . Items within mv cc are called the intra-

extension items of prefix Sc , while items within � �nimvi dd�1 are called

inter-extension items of prefix Sc . To distinguish the element mv cc from

other elements, we will underline its first intra-item as � �kjj eee ,,, 21 �� .

54

For example, given the sequence � �� 543,21 S , sequences � �� 543,2 ,

� �� 543 , and � �� 54 are called suffix sequences of S with respect to

prefix sequences � �1 , � �� 21 , and � �� 3,21 respectively. Note sequence

� �1 cannot be extended with any intra-item, while sequence � �� 21 can

only be extended with the intra-item of 3.

Figure 4.12: Prefix-based traversal of frequent sequence lattice

Lemma 4.5.1: All prefix sequences of a [-frequent sequential pattern are

[-frequent.

This is obvious based on the a priori heuristics. Let us turn our

attention to the running example in Figure 4.1. Based on the above

lemma, all [-frequent sequences can be enumerated by traversing the

frequent sequence lattice and, for each [-frequent (k)-sequence � �1!k ,

only extending its prefix (k-1)-sequence. As mentioned in section 4.3.4, a

{ }

(1)

(1,6)

(2) (4) (6)

(2,6) (2)(1) (4)(1) (4)(2) (4)(6) (6)(1) (1,2)

(4)(6)(1) (1,2,6) (2,6)(1) (4)(2)(1) (4)(2,6)

(4)(2,6)(1)

55

suffix-based traversal of frequent sequence lattice is also possible.

Figure 4.12 shows prefix-based traversal of our running example’s

frequent sequence lattice.

Now what remains is to partition the database such that at each

node of Figure 4.12 only the relevant data is processed to find out if a [-

frequent suffix sequential pattern exits.

Definition 4.5.2: Let access sequence mvvvS ccc c ,,, 21 � be a prefix-match

of access sequence nvvvS ,,, 21 � � �nm d , such that for

niii m �dddd � 1211 � , 121 121 ,,,

��cccmimii vvvvvv %�%% and for ,,,, 21 jm eeev ccc c �

km eeev ,,, 21 � � �kj d , klll j ddddd �211 , jljll eeeeee c c c ,,,

21 21 � . We

call access sequence nmmm vvvvS ,,,, 21 ��cc cc a suffix-match of S with

respect to Sc , where kjjm eeev ,,, 21 �� cc . Let this relation be denoted as

SSS ccxc . If element mv cc is not empty the suffix-match sequence is

denoted as � ccS . Let the prefix-projection of sequence s be defined by

function � � � �� ^ ` ^ `� %%% �� x :��x : EVEOEVOOEVEV ssssss |&,&|,

as the set of suffix-match sequences of s with respect to prefix V .

Note that a prefix-projection set contains all intra-suffix-match

sequences and only those inter-suffix-match sequences that are not

subsequence of the rest. For example, given the sequence

1,3,4,6)3)(1,2,6)((3,4)(1,2, S , its prefix-projection sequences with respect

to prefix sequences � �3 , � �� 13 , and � �� 113 are the sequence sets

^ `)6,4(,)6,4,3,1)(6,2,1)(3,2,1)(4( , ^ `)6,4,3(,)6,4,3,1)(6,2(,)6,4,3,1)(6,2,1)(3,2( , and

^ `)6,4,3(,)6,4,3,1)(6,2( respectively. For the sake of abbreviation from here

on, we will overlap all suffix-match sequences in a prefix-projection set,

as follows. The starting elements of all intra-suffix-matches that are

subsequence of another are indicated by an underline in the maximal

56

super-sequence of the set. Therefore, the previous example’s prefix-

projections are: )6,4,3,1)(6,2,1)(3,2,1)(4( , )6,4,3,1)(6,2,1)(3,2( , and

)6,4,3,1)(6,2( .

Definition 4.5.3: Let V -projected database, denoted as �|S , be the

collection of prefix-projection sequences returned by function � �V,s: .

For example, Figure 4.13 shows the projected databases of each 0.5-

frequent 1-sequence of the running example, where all infrequent items

have also been filtered out.

Figure 4.13: Projected databases of 0.5-frequent 1-sequences

Definition 4.5.4: Let V be a [-frequent sequential pattern in database

S. The support of the sequential pattern OVE � (or �� OVE ) in �|S is

^ `sSs %O� ||� (or ^ `sSs %�� O� || ).

For example, let’s consider the process of counting 0.5-frequent 1-

sequences in <(1)>-projected database of Figure 4.13. Sequence

<(2)(1,2,6)(1,4,6)> can extend prefix <(1)> with inter-extension items

{1,2,4,6} and intra-extension items {2,4,6}. Therefore, we increment the

inter- and intra-counters of each item accordingly.

<(2)(1,2,6)(1,4,6)>, <(2,6)>, <(2,6)>

<(2)> <(1,2,6)(1,4,6)>, <(6)>, <(6)>, <(6)(1)>

<(4)> <(6)>, <(2,6)(1)>

<(6)> <(1,4,6)>, <(1)>

<(1)> Prefix Projected Database

57

Lemma 4.5.2: Let V be a [-frequent sequential pattern and OVE � (or

�� OVE ) a sequential pattern in sequence database S, then

� � � �OE �|supsup SS (or � � � �� OE �|supsup SS ).

Proof: From Definition 4.5.3, we know each V -projected database only

contains sequences that can extend prefix V , which are exactly the

sequences contributing to � �ESsup . Furthermore, the support counting

procedure of Definition 4.5.4 ensures that we only increment the

counters that contribute to support of OVE � (or �� OVE ). �

4.5.2 Level-by-Level PrefixSpan Algorithm

In this section we explain the PrefixSpan-1 algorithm that traverses the

frequent item lattice level-by-level, depth-first.

In each projection, this algorithm attempts to increases the length

of [-frequent sequential pattern (projection prefix) by one. The mining

process, as shown in Figure 4.14, consists of three main steps performed

on each database �|S recursively. Note, when V , projection

database �|S represents the original database.

x� Scan projection database �|S once and find its [-frequent inter-

and intra- 1-sequences.

x� For each [-frequent 1-sequence, extend prefix V accordingly and

print the resulting sequential pattern.

x� If printed any [-frequent sequential pattern, then scan projection

database �|S a second time and for each unique inter- 1F�D (or

intra- �

� 1FD ) [-frequent 1-sequence found in each sequence

�|Ss� , add its prefix-project � �D,s: (filtering out infrequent items)

to projection database ��|S (or � �|S ).

58

x� Recursively mine each newly created projected database if their

number of sequences is greater than minimum support count.

Input: A sequence database S, and minimum support threshold min_sup Output: The complete set of frequent sequential patterns Method: F = {ø} Call PrefixSpan-1({ø}, 0, S) return F // V : projection pattern of projection database �|S .

// l: length of projection pattern, V .


Subroutine PrefixSpan-1( U , l, �|S )


increment count of each intra 1-sequence of s in candidate set �

1C

increment count of each inter 1-sequence of s in candidate set �

1C

^ `supmin_sup.|11 t� ��

DD CF

^ `supmin_sup.|11 t� ��

DD CF

��

11 FFFF

if 011 ��

�FF then return

for all sequences s in projected database |S do

for each unique intra 1-sequence

� 1FD of s add � �D,s: to �� .|S

for each unique inter 1-sequence �

� 1FD of s add � �D,s: to �� .|S

for each �

� 1FD do

if supmin_| . t��S then Call PrefixSpan-1( �DV . , l+1, �� .|S )

for each

�� 1FD do

if supmin_|.

t��S then Call PrefixSpan-1( DV . , l+1, �� .|S )

return

Figure 4.14: The PrefixSpan-1 mining algorithm

Figure 4.15 shows all the projection databases generated and 0.5-

frequent sequences printed at each projection by PrefixSpan-1 in our

59

running example. In comparison to FreeSpan there is significantly less

work to be done per projection. On the other hand, there are more

projections to process. The number of counters required to update for 1-

sequences in PrefixSpan is twice as many as FreeSpan. But this

amounts to a very small memory increase.

Prefix Projected Database Freq.

Inter-items

Freq. Intra-items

Frequent [-Patterns

<(3,4)(1,2,3)(1,2,6)(1,3,4,6)>, <(1,2,6)(5)>, <(1,2,6)>, <(4,7,8)(2,6)(1,7,8)>

1,2,4,6

<(1)>, <(2)>, <(4)>, <(6)>

<(1)> <(2)(1,2,6)(1,4,6)>, <(2,6)>, <(2,6)>

2,6 <(1,2)>,<(1,6)>

<(1,2)> <(2,6)(6)>, <(6)>, <(6)> 6 <(1,2,6)> <(1,2,6)> <(6)> <(1,6)> <(6)> <(2)> <(1,2,6)(1,4,6)>, <(6)>,

<(6)>, <(6)(1)> 1 6 <(2)(1)>,<(2,6)>

<(2)(1)> <(6)(1,6)> <(2,6)> <(1,6)>, <(1)> 1 <(2,6)(1)> <(2,6)(1)> <(4)> <(1,2)(1,2,6)(1,4,6)>,

<(2,6)(1)> 1,2,6 <(4)(1)>,<(4)(2)>,<(4)(6)>

<(4)(1)> <(2)(1,2,6)(1,6)> <(4)(2)> <(1,2,6)(1,6)>, <(6)(1)> 1 6 <(4)(2)(1)>,

<(4)(2,6)> <(4)(2)(1)> <(6)(1,6)> <(4)(2,6)> <(1,6)>, <(1)> 1 <(4)(2,6)(1)> <(4)(2,6)(1)> <(4)(6)> <(1,6)>, <(1)> 1 <(4)(6)(1)> <(4)(6)(1)> <(6)> <(1,4,6)>, <(1)> 1 <(6)(1)> <(6)(1)>

Figure 4.15: Projection databases in PrefixSpan-1

In FreeSpan we could not guarantee that a projected database

would shrink in size, whereas, PrefixSpan demonstrates a greater

selectivity. In practice, only a small set of sequential patterns grow very

long. Hence, the size of projected database shrinks rapidly as the

60

projection prefix grows. Also, since PrefixSpan projects suffix sequences,

a sequence must shrink in length by at least one. Note the number of

projections in PrefixSpan-1 equals to the number of [-frequent

sequences. This can be greater than FreeSpan. Having to process more

projections, even though smaller in size, could still add up to a non-

trivial cost. In the next section we develop a strategy to reduce the cost

of creating a projection.

4.5.3 Pseudo Projection

The main idea behind pseudo projection is to reduce the required storage

to represent a projection database by enabling different projection

databases to share the same physical database while accessing the parts

that are relevant to it. This also reduces the cost of creating a projection

database.

Figure 4.16: Pseudo Projection with virtual-memory window

To this end we use pointers to share the same physical database.

Projecting a sequence shrinks it in two ways: by removing infrequent

items and a section of its prefix (refer to Figure 4.15). By sharing the

{p1, p2, p3} {p1} {p1}

<(1)>-projected database

<(3,4)(1,2,3)(1,2,6)(1,3,4,6)> <(1,2,6)(5)> <(1,2,6)> <(4,7,8)(2,6)(1,7,8)>

{p1, p2 } {p1} {p1} {p1}

<(2)>-projected database

Inter-process and machine sharable virtual memory

Virtual-memory window slides and page-fault is issued to load that section of the database

61

same physical database, we prohibit ourselves from apply any

modifications to the database, such as removing infrequent items.

Therefore, in pseudo projection databases, we only allow a sequence to

shrink by ignoring section of its prefix. Let us look at the first sequence

in the original database in Figure 4.15. It is projected to � �1 -projected

database at three points, all of which are intra-suffix-matches. Hence,

we represent the first sequence in � �1 -projected database as three

pointers to different items of the first sequence in the original database,

while the second and third sequences are represented with a single

pointer each. In this fashion we reduce the memory requirement for

creating a projection database at the cost of recounting parent

projection’s infrequent items. This is a small cost to pay compared to the

cost of creating physical projections at each recursion, not to mention

their storage requirement.

In practice databases are larger than the available memory and yet

a small portion of that available memory, called the system I/O buffer, is

used to load the database into memory. Repeated use of pseudo

projection can create large regions between pointers that are not

relevant. Different projections can also randomly access different

sections of the database. As pseudo projected databases become

randomly scattered chunks of sequences, we incur significant

unnecessary disk I/O as we load and reload database pages and only use

small sections of it.

First, to manage loading of database pages more efficiently, we

leverage on the available memory management technology and create a

virtual-memory window over the database. Thereby, exercising more

control over the number of database pages loaded and the page fault

62

strategy used by the system. Popular operating systems, such as Unix,

Linux, and NT, provide high-priority kernel-level primitives to implement

these objects. In NT these kernel-level objects, called Section Objects

(SO), are also sharable across process and machine. As illustrated in

Figure 4.16 each pointer now is valid within a specific virtual-memory

window. Our virtual-memory window manager is responsible for loading

the correct database pages before we de-reference the first item in each

sequence.

Second inherent weakness of this approach is due to growing

chunks of irrelevant data between pointer. To overcome this, first, we

measure the density of a pseudo-projected database, expressed as the

ratio of the number of sequences over the number of system pages (its’

size is system dependent, usually 4 MB) it spans over. Second, we

measure the selectivity of a pseudo-projected database, expressed as the

density ratio of pseudo-projected database over its physical database.

The physical projection operation is performed intermittently when the

selectivity drops bellow a certain threshold. In practice, when database

size falls bellow the size of a few system pages (usually 12 MB, but

depends on the available memory) it does not get swapped out by the

system and remains in the memory. At this point, performing further

physical projections becomes unnecessary. Therefore, the virtual

memory window size is set to this proj_min_size threshold.

However, there is a tradeoff on the value selected for selectivity. It

is impractical to perform too many physical projection as disk write

operation is costly. As the selectivity threshold increases, the cost of

projection rises, since more nodes at the lower levels of the frequent

sequence lattice will also need to be physically projected until it reaches

63

proj_min_size. On the other hand, the density of the projected database

is guaranteed to be higher and their size smaller and thus the cost of

subsequent projection and counting in that sub-tree is reduced. In

practice, we found that there is only a small difference in choosing a

value for selectivity threshold between 0.35 and 1. Bellow 0.35, the cost

of unnecessary disk I/O is high as physical projections are not performed

often enough, but between 0.35 and 1, the cost of physical projection is

balanced by the reduced unnecessary disk I/O costs. The reason for this

result is that the selectivity value depends only on the support of the [-

frequent 1-sequences. In the real databases, many 1-sequences have

low support and low correlation with their prefix, and thus, any

selectivity threshold above the supports will result in creating small

physical projection. After a few recursions, this quickly reduces the size

of the physical projections to bellow the threshold of proj_min_size and

thus all the subsequent operations in that sub-tree is done from the

memory.

4.5.4 Scaling up

Another major cost of projection based algorithms is their storage

requirement. With the size of Web logs gathered daily reaching gigabytes

and growing, and despite the availability of inexpensive and large storage

devices and growing size of current main memory, memory and disk

limitations still impose great restrictions on projection based algorithms.

As analyzed in Section 4.3.3, SPADE algorithm can continue with only

two projections present per level at the cost of processing a greater

search space. In PrefixSpan with pseudo projection all the information

necessary for the mining process is contained within a projection, and

64

thus the algorithm only requires one projection per level, without

suffering any increase in its search space. When there is a disk or

memory limitation present, similar to SPADE algorithm, we pay the cost

of performing multiple database scans to create the child projections in

stages.

4.6 Summary

In this chapter, we described in detail PrefixSpan with pseudo projection,

a new algorithm for fast mining of Web traversal patterns in Web log

files. In this approach we progressively partition the database into

smaller sub-databases using the frequent sequence lattice. Depending

on the selectivity of the prefix sequence, it is very likely that its projected

database can be processed in main-memory. In the next chapter we will

present our experimental results.

65

Chapter 5

Performance Analysis and Discussions

In this chapter we compare the performance of PrefixSpan with pseudo

projection with the PSP+ and FreeSpan algorithms. The PSP+ algorithm

is implemented as described in Section 4.3.3. Experiments were

performed on a 647 MHz Pentium III PC, with 192 MB of main memory,

576 MB of virtual memory, 10 GB of local disk space (Fujitsu

MHM2100AT), and running Microsoft Windows 2000 Server. We

implemented all the algorithms using Microsoft Visual C++ 6.0.

Short form Description Parameter

D Number of customers -ncust C Avg. number of transactions per

customer -slen

T Avg. number of items per transaction -tlen S Avg. number of items in maximal

potential frequent sequences -seq.patlen

I Avg. number of items in maximal potential frequent item sets

-lit.patlen

NS Number of maximal potential frequent sequences

-seq.npats

NI Number of maximal potential frequent item sets

-lit.npats

N Number of items -nitems

Table 5.1: Synthetic data generation program’s parameters

5.1 Synthetic Datasets

Our synthetic datasets are generated using the publicly available

synthetic data generation program of the IBM Quest data mining project

[IQ01], which has been used in most sequential pattern mining studies

[SA96, HPM+01, MCP98, Z00]. The datasets consist of sequences of item

sets, where each item set represents a market basket transaction. For

66

the purpose of Web log mining we do respect the order of items within

each transaction. Table 5.1 shows the parameters we set for the

synthetic data generation program. The generator program works as

follows. First N items, from 0 to N-1, are used to create NI maximal item

sets of average length I. Then, NS maximal sequences of average length S

are created from the members of NI maximal item sets. Next, an average

of C members of NS maximal sequences are added to a customer

sequence as transactions, ensuring average transaction length of T. The

process continues until D customer sequences are created. Table 5.2

shows our test data sets and their parameter settings. For additional

detail on the data generation program, the interested reader can refer to

[AS95]. Similar to [SA96, MCP98], we set NS = 5,000 and NI = 25,000.

As reading a binary file is more efficient than parsing a text file, all

datasets are generated in the binary format.

Data set N C T S I D Size

(MB) C10-T2.5-S4-I1.25-D100K 1K 10 2.5 4 1.25 100K 11.44 C10-T5-S4-I1.25-D100K 1K 10 5 4 1.25 100K 20.06 C10-T5-S4-I2.5-D100K 1K 10 5 4 2.5 100K 19.41 C20-T2.5-S4-I1.25-D100K 1K 20 2.5 4 1.25 100K 24.36 C20-T2.5-S4-I2.5-D100K 1K 20 2.5 4 2.5 100K 22.15 C20-T2.5-S8-I1.25-D100K 1K 20 2.5 8 1.25 100K 23.88 C10-T8-S7-I3-D100k 1K 10 8 7 3 100K 20.07 C10-T5-S5-I1.5-D100k 1K 10 5 5 1.5 100K 19.21

Table 5.2: Synthetic data sets

5.2 Berkeley Web Log Dataset

The Web log file was collected from the University of California Berkeley’s

website. The site hosts a variety of information, ranging from university,

67

course, and department information to individual student’s websites.

The Web log covers the year of 1999 and has 103,680 distinct URLs,

1,079,621 distinct hosts, and a total of 1,079,771 sequences. Its size is

45.79 MB and has an average sequence length of 7. Figure 5.1 shows

the distribution of its sequence length.

Figure 5.1: Distribution of Web log sequence length

5.3 Comparison of PrefixSpan with PSP+ and FreeSpan

The scalability of PrefixSpan, PSP+ and FreeSpan as the support

threshold decreases from 1% to 0.25% is shown in Figures 5.2 and 5.3.

It is easy to see that PrefixSpan scales much better than PSP+ and

FreeSpan. As shown in the figures, as the support threshold goes down,

the number and the length of [-frequent sequences increases. The

change in the density of the database’s frequent sequence lattice from

one support threshold to the other determines the degree of this

increase, which is completely database dependent. The density of

frequent sequence lattice directly affects the number of the candidates in

PSP+.

0

10

20

30

40

50

60

70

80

90

100

1 2 3 4 6 9 17 485876Sequence Length

Per

cen

tile

68

Figure 5.2: Performance Comparison: Synthetic Dataset

N1K-C10-T2.5-S4-I1.25-D100K (11.44 MB)

0

20

40

60

80

1% 0.75% 0.50% 0.25%Minimum support

Tim

e (s

eco

nd

s)

PrefixSpan-1

PSP+

FreeSpan-1

FreeSpan-2

N1K-C10-T2.5-S4-I1.25-D100K (11.44 MB)

02468

10121416

1 2 3 4 5 6 7

Co

un

t (t

ho

usa

nd

s)

Frequent k-sequence

1%

0.75%

0.50%

0.25%

N1K-C10-T5-S4-I1.25-D100k (20.06 MB)

0200400600800

1000120014001600


Tim

e (s

eco

nd

s)

PrefixSpan-1

PSP+

FreeSpan-1

FreeSpan-2

N1K-C10-T5-S4-I1.25-D100k (20.06 MB)

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11C

ou

nt

(th

ou

san

ds)

Frequent k-sequence

1%

0.75%

0.50%

0.25%

N1K-C10-T5-S4-I2.5-D100k (19.41 MB)

0

200

400

600

800

1000

1200

1400

1600


Tim

e (s

eco

nd

s)

PrefixSpan-1

PSP+

FreeSpan-1

FreeSpan-2

N1K-C10-T5-S4-I2.5-D100k (19.41 MB)

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Co

un

t (t

ho

usa

nd

s)

Frequent k-sequence

1%

0.75%

0.50%

0.25%

N1K-C20-T2.5-S4-I1.25-D100k (24.36 MB)

0200400600800

1000120014001600


Tim

e (s

eco

nd

s)

PrefixSpan-1PSP+FreeSpan-1FreeSpan-2

N1K-C20-T2.5-S4-I1.25-D100k (24.36 MB)

020406080

100120140160

1 2 3 4 5 6 7 8 9 10 11

Co

un

t (t

ho

usa

nd

s)

Frequent k-sequence

1%

0.75%

0.50%

0.25%

69

Figure 5.3: Performance Comparison: Synthetic Datasets

The longer is the candidates and the higher is the number of candidates,

the greater is the size of the candidate tree and the longer the candiates

become, the more expensive it is for PSP+ to do pattern matching for

each database sequence, particularly for long sequences.

For each sequence of a projected-database, FreeSpan needs to

count the frequency of all subsequences with the projected pattern of its

projected database. This is an extra cost per projected database over

PrefixSpan, that increases exponentially with the length of sequence.

This extra cost is conpensated for only if the number of projections in

FreeSpan is significantly less than PrefixSpan. Obviously, this depends

on the repetition of items in [-frequent sequences. In the worst case

where all [-frequent sequence have a unique projected pattern, the

number of projected databases could be greater than the number of [-

frequent sequences, since every projected database must be processed

N1K-C20-T2.5-S4-I2.5-D100k (22.15 MB)

0200400600800

1000120014001600


Tim

e (s

eco

nd

s)

PrefixSpan-1

PSP+

FreeSpan-1

FreeSpan-2

N1K-C20-T2.5-S4-I2.5-D100k (22.15 MB)

020406080

100120140160

1 2 3 4 5 6 7 8 9 10 11

Co

un

t (t

ho

usa

nd

s)

Frequent k-sequence

1%

0.75%

0.50%

0.25%

N1K-C20-T2.5-S8-I1.25-D100k (23.88 MB)

0200400600800

1000120014001600


Tim

e (s

eco

nd

s)

PrefixSpan-1

PSP+

FreeSpan-1

FreeSpan-2

N1K-C20-T2.5-S8-I1.25-D100k (23.88 MB)

020406080

100120140160

1 2 3 4 5 6 7 8 9 10 11 12C

ou

nt

(th

ou

san

ds)

Frequent k-sequence

1%

0.75%

0.50%

0.25%

70

and some projected databases (that is, some leaf nodes in projection

database tree) may not generate any [-frequent sequence. In most of our

experiements for FreeSpan, the number of projected databases was

greater than or close to the number of [-frequent sequences. However, in

PrefixSpan we can guarantee that the number of projection databases is

equal to or less than the number of [-frequent sequences. Even though

FreeSpan-2 reduces the number of projected databases, most of the time

it hold more projection databases in memory at a given time. As shown

in Figures 5.2 and 5.3 the number of [-frequent 2-sequences are

extremely large for low support thresholds. This results in memory

swapping, a significant overhead for FreeSpan-2. We have addressed

this drawback in PrefixSpan with pseudo projection in Section 4.5.4.

Figure 5.4: Performance Study: Long Synthetic Datasets

PrefixSpan-1 Performance

0

200

400

600

800

1% 0.75% 0.50% 0.25%Minimum Support

Tim

e (s

eco

nd

s)

N1K-C10-T8-S7-I3-D100k

N1K-C10-T5-S5-I1.5-D100k

PrefixSpan-1 Performance

0

200

400

600

800

1000

1% 0.75% 0.50% 0.25%Minimum Support

Pro

j / s

eco

nd

N1K-C10-T8-S7-I3-D100k

N1K-C10-T5-S5-I1.5-D100k

N1K-C10-T8-S7-I3-D100k (20.07 MB)

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6 7 8

Co

un

t (t

ho

usa

nd

s)

Frequent k-sequence

1%

0.75%

0.50%

N1K-C10-T5-S5-I1.5-D100k (19.21 MB)

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Co

un

t (t

ho

usa

nd

s)

Frequent k-sequence

1%

0.75%

0.50%

0.25%

71

Figure 5.4 shows the scalability of PrefixSpan-1 in databases with long

sequences and denser frequent sequence lattice. The performance of the

other algorithms were not included as they were by far worse. The figure

also shows the number of projected databases processed per second in

PrefixSpan. As the support threshold is lowered, this number goes up

dramatically, especially for data set C10-T5-S5-I1.5-D100K. This fact

explains the scalability of PrefixSpan and can be attributed to the

following. As the sequential patterns become longer, the projected

databases become smaller, much smaller than FreeSpan. Therefore, the

corresponding processing time also reduces. There are serveral other

reasons why PrefixSpan outperforms FreeSpan and PSP+.

1. PrefixSpan performs two simple scans of the projected database to

find inter- and intra-[-frequent items.

2. There are no complicated data structures used and there are no

overhead of searching for subsequences.

3. Pseudo projection ensures a low memory usage and reduces the

cost of creating a projection, controlled physical projections

ensures dense datasets, virtual memory ensures low disk I/O and

high data locality, and partial materlialization of pseudo

projections (section 4.5.4) ensures no memory swapping.

Having addressed the strengths of PrefixSpan, let us look at the database

characteristic that significantly affects its performance. Even though

PrefixSpan guarantees that, as the sequential patterns become longer,

the projected databases become smaller, this selectivity factor is

database dependent. As shown in Figure 5.4, data set C10-T5-S5-I1.5-

D100K shows a much lower cost per projection, which can be explained

by the selectivity factor of its [-frequent sequences.

72

We also compared the performance of the algorithms on Berkeley’s

Web log database. The results are shown in Figure 5.5. At the highest

support threshold of 2% the other algorithms would take over 2000

seconds. This is explained by the excessively long sequences of the Web

log. Therefore, we did not include them in the graph.

Figure 5.5: Performance Study: Web Log Dataset

5.4 Scale up

(a) (b)

Figure 5.6: Scale up: Number of Customers

Figure 5.6(a) shows the scalability of PrefixSpan, FreeSpan and PSP+ as

the number of data sequences is increased. All the algorithms were run

on the N1K-C10-T2.5-S4-I1.25 data set with different minimum support

UCBerkeley 1999 (45.79 MB)

0

5

10

15

20

25

30

1.25% 1.00% 0.75% 0.50%Minimum support

Tim

e (s

eco

nd

s)

PrefixSpan-1

UCBerkeley 1999 (45.79 MB)

0

20

40

60

80

100

120

1 2 3 4 5 6Frequent k-sequence

Co

un

t

1.25%

1.00%

0.75%

0.50%

N1K-C10-T2.5-S4-I1.25 (11.44-57.2 MB)

1

2

3

4

5

6

7

100K 200K 300K 400K 500KNumber of Customers

Rel

ativ

e T

ime

Psp+ (1%)

Psp+ (0.5%)PrefixSpan-1 (1%)

PrefixSpan-1 (0.5%)FreeSpan-1 (1%)

FreeSpan-1 (0.5%)FreeSpan-2 (1%)

FreeSpan-2 (0.5%)

N1K-C10-T5-S5-I3 (11.47-92.23 MB)

1.00

2.00

3.00

4.00

5.00

6.00

100K 200K 300K 400K 500KNumber of Customers

Rel

ativ

e T

ime

PrefixSpan-1 (0.75%)

PrefixSpan-1 (0.5%)

PrefixSpan-1 (0.25%)

73

thresholds ranging from 2% to 0.5% as the number of customers was

increased ten times from 100,000 to 1 million. Figure 5.6(b) shows

scalability of PrefixSpan on the N1K-C10-T5-S5-I3 data set that has

somewhat longer sequences. It can be seen that all the algorithms are

linearly scalable within the range shown in the figure.

74

Chapter 6

Conclusions and Future Works

Information content of WWW is growing at an exponential rate and it is

not surprising to find users having difficulty navigating it and finding

relevant information, or e-commerce sites having difficulty to observe

potential customers. In this study we have used the Web Access log files

of a specific Web site and data mining techniques to do such analysis. In

the following sections, our work is summarized and research directions

are discussed.

6.1 Conclusions

In this thesis we have studied two problems associated with scalable

Web Usage Mining, namely multidimensional analysis and sequential

pattern mining of Web log files. First we introduced a scalable system

architecture that uses static multidimensional aggregations of OLAP to

do simple frequency analysis. Next we developed a novel, scalable and

efficient sequential pattern discovery algorithm to analyze [-frequent

sequential patterns. We have experimentally evaluated our approach in

both cases. The main contributions of this thesis are:

1. Implemented a system that incorporates the multidimensional data

cubes and OLAP technology to interactively extract implicit

knowledge from Web log files.

2. Created a new algorithm for efficient and scalable mining of Web

sequential patterns, called PrefixSpan with pseudo projection.

75

6.2 Future Works

The use of World Wide Web as the means for marketing and selling has

increased dramatically in the very recent past and has intensified the

need to understand user’s online behavior. Web Usage Mining is a new

research field that is avidly followed by many scholars and commercial

businesses. However, the available collections of information are very

limited and are not designed for such research, and there are no

established data mining techniques for such data sources. But, the list

of open problems and opportunities are also long. We will now introduce

some of them within the scope of our research context.

6.2.1 Real-time Multidimensional Analysis

The construction of the multidimensional data cubes for OLAP and

knowledge discover is very time consuming. Due to the volatile nature of

Web log data, most users only require simple usage statistics and canned

reports as they occur. As such, we need to improve our Web log

multidimensional analysis and Web Log parser to address real business

needs.

6.2.2 Scalability

Web logs will continue to grow in size and new sources of data will be

created for Web Usage Mining. PrefixSpan with pseudo projection

naturally affords to be extended to a parallel processing algorithm. Our

goal should be to handle extremely large data sets.

76

6.2.3 Incremental Algorithm

As the Web sites or business needs change over time, some sequential

patterns become invalid and some need to be updated. We need to

extend our algorithm to solve this problem.

6.2.4 Sequential Pattern Mining with Constraints

Sequential pattern mining algorithms tend to generate a huge number of

sequences, and at any given time, not all of those are of interest to the

user. For example, a marketing analyst may only be interested in the

activity of those online customers who have visited certain pages in a

specific time period. In general, the discovered patterns must meet

certain rules and conditions, which we categorize as follows.

x� Constraint on page-view attribute: Page-view attributes may include

page type, page name, access date, view time, and activities

associated with the page view, such as buying or traversal of a

certain hyper-link. These constraints are unary and enforce

certain limitation on a desired page view.

x� Constraint on user-visit attribute: User-visit attributes may include

length, duration, and minimum and maximum gap between page

views. The constraint may also include a regular expression to

impose restriction on pattern of page views, their view time and

gaps.

x� Constraint on user sequence attribute: User sequence attributes

may include duration, length, number of visits, average visit

length and duration, and minimum and maximum gap between

visits. The constraint may also include a regular expression to

77

impose restriction on repeat patterns among visits, visit duration

and gaps.

In conclusion, the importance of Web Usage Mining will continue to grow

with the popularity of WWW and undoubtedly will have a significant

impact on the study of online user behavior.

78

Bibliography

[A01] J.-M. Adamo. Data Mining for Association Rules and Sequential Patterns. Springer Verlag, New York, 2001.

[AIS93] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between Sets of Items in Large Databases. In Proc. Of the 1993 ACM SIGMOD Conference, pages 207-216, Washington DC, USA, May 1993.

[AS94] R. Agrawal and R. Srikant. Fast Algorithms for Mining Generalized Association Rules. In Proc. of the 20th International Conference on Very Large Databases (VLDB’94), pages 487-499, Santiago, Chile, September 1994.

[AS95] R. Agrawal and R. Srikant. Mining Sequential Patterns. In Proc. of the 11th International Conference on Data Engineering (ICDE’95), pages 3-14, Taipei, Taiwan, March 1995.

[BBA+99] A. G. B FKQHU��0��%DXPJDUWHQ��6��6��$QDQG��0��'��0XOYHQQD��and J. G. Hughes. Navigation Pattern Discovery from Internet Data. In KDD Workshop on Web Usage Analysis and User Profiling (WebKDD’99), pages 25-30, San Diego, CA, USA, 1999.

[BL00a] J. Borges and M. Levene. Data Mining of User Navigation Patterns. In Web Usage Mining and User Profiling, B. Masand and M. Spliliopoulou, editors, Lecture Notes in Artificial Intelligence (LNAI 1836), pages 92-111, Springer Verlag, Berlin, 2000.

[BL00b] J. Borges and M. Levene. A Heuristic to Capture Longer User Web Navigation Patterns. In Proc. of the first International Conference on Electronic Commerce and Web Technologies (EC-Web’00), pages 155-164, London-Greenwich, U.K., September 2000.

[C93] E. F. Codd. Providing OLAP (On-Line Analytical Processing) to User-analysts: An IT Mandate. Technical Report TR-9300011, E. F. Codd and Associates, 1993.

[C97] J. G.-Cumming. Hits and Misses: A Year Watching the Web. In Proc. 6th Int’l World Wide Web Conf., Santa Clara, CA, USA, April 1997.

[C00] R. Cooley. Web Usage Mining: Discovery and Application of Interesting Patterns from Web Data. Ph.D. Thesis, University of Minnesota, May 2000.

[CD97] S. Chaudhuri and U. Dayal. An Overview of Data Warehouse and OLAP Technology. SIGMOD Record, 26(1):65-74, March 1997.

79

[CMS97] R. Cooley, B. Mobasher, and J. Srivastava. Web Mining: Information and Pattern Discovery on the World Wide Web. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97), Newport Beach, CA, USA, November 1997.

[CMS99] R. Cooley, B. Mobasher, and J. Srivastava. Data Preparation for Mining World Wide Web Browsing Patterns. Journal of Knowledge and Information Systems, 1(1):5-32, February 1999.

[CP95] L. Catledge and J. Pitkow. Characterizing Browsing Behaviors on the World Wide Web. Computer Networks and ISDN Systems, 27(6), North-Holland, 1995.

[CPY98] M.-S. Chen, J. S. Park, and P. S. Yu. Efficient Data Mining for Path Traversal Patterns. IEEE Trans. on Knowledge and Data Engineering (TKDE), 10(2):209-221, March 1998.

[EV97] S. Elo-Dean, M. Viveros. Data Mining the IBM Official 1996 Olympics Web Site. Technical report, IBM T.J. Watson Research Center, 1997.

[FPS96a] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From Data Mining to Knowledge Discovery: An Overview. In Advances in Knowlede Discovery and Data Mining, G. Piatetsky-Shapiro and J. Frawley, editors, AAAI Press, Menlo Park, CA, 1996.

[FPS96b] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. The KDD Process for Extracting Useful Knowledge from Volumes of Data. In Communications of the ACM – Data Mining and Knowledge Discovery in Databases, 39(11):27-34, November 1996.

[GBL+96] J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross-tabs and Sub-totals. In Proc. Of the 12th Int’l Conference on Data Engineering (ICDE’96), pages 152-159, New Orleans, USA, 1996.

[GEN96] NetGenesis. http://www.netgen.com, 1996.

[HCC+97] J. Han, J. Chiang, S. Chee, J. Chen, Q. Chen, S. Cheng, W. Gong, M. Kamber, G. Liu, K. Koperski, Y. Lu, N. Stefanovic, L. Winstone, B. Xia, O. R. Zaïane, S. Zhang, and H. Zhu. DBMiner: A System for Data Mining in Relational Databases and Data Warehouses. In Proc. CAS-CON’97: Meeting of Minds, pages 249-260, Toronto, Canada, November 1997.

[HFW+96] J. Han, Y. Fu, W. Wang, J. Chiang, W. Gong, K. Koperski, D. Li, Y. Lu, A. Rajan, N. Stefanovic, B. Xia, and O. R. Zaïane. DBMiner: A System for Mining Knowledge in Large Relational Databases. In Proc. 1996 Int’l Conference Data Mining and Knowledge Discovery (KDD’96), pages 250-255, Portland, Oregon, USA, August 1996.

80

[HPM+00] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. C. Hsu. FreeSpan: Frequent Pattern-Projected Sequential Pattern Mining. In Proc. 2000 Int. Conf. Knowledge Discovery and Data Mining (KDD’00), pages 355-359, Boston, MA, USA, Aug. 2000.

[HPM+01] J. Han, J. Pei, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In Proc. 2001 Int. Conf. on Data Ingineering (ICDE’01), Heidelberg, Germany, April 2001.

[IQ01] IBM Quest Data Mining Project Synthetic Data Generation Program: http://www.almaden.ibm.com/cs/quest/syndata.html.

[K95] W. Klösgen. Efficient Discovery of Interesting Statements in Databases. Journal of Intellegent Information Systems(JIIS), 4(1):53-69, January 1995.

[KB00] R. Kosala and H. Blockheel. Web Mining Research: A Survey. In SIGKDD Explorations, 2(1):1-15, July 2000.

[LOU95] A. Luotonen. The Common Logfile Format. http://www.w3.org/Daemon/User/Config/Logging.html, 1995.

[MCP98] F. Massegila, F. Cathala, and P. Poncelet. The Psp Approach for Mining Sequential Patterns. In Proc. European Symposium on Principle of Data Mining and Knowledge Discovery (PKDD’98), pages 176-184, Nantes, France, September 1998.

[MIN00] Mineit Software Ltd. Easyminer. http://www.mineit.com, 2000.

[MPC99] F. Masseglia, P. Poncelet, and R. Cicchetti. An Efficient Algorithm for Web Usage Mining. Networking and Information Systems Journal (NIS), 2(5-6):571-603, 1999.

[MPT99] F. Masseglia, P. Poncelet and M. Teisseire. Using Data Mining Techniques on Web Access Logs to Dynamically Improve Hypertext Structure. In ACM SigWeb Letters, 8(3):13-19, October 1999.

[MPT00] F. Masseglia, P. Poncelet, and M. Teisseire. Web Usage Mining: How to Efficiently Manage new Transactions and New Clients. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD'00), pages 530-535, Lyon, France, September 2000.

[MT96] H. Mannila, and H. Toivonen. Discovering generalized episodes using minimal occurrences. In Proc. of the Second Int’l Conference on Knowledge Discovery and Data Mining (KDD’96), pages 146-151, Portland, Oregon, August 2-4, 1996.

[MTV95] H. Mannila, H. Toivonen, and A. I. Verkamo. Discovering Frequent Episodes in Sequnces. In Proc. of the First Int’l Conference on Knowledge Discovery and Data Mining (KDD’95), pages 210-215, Montreal, Quebec, 1995.

81

[OC96] T. Oates and P. R. Cohen. Searching for structure in multiple streams of data. In Proc. Of Thirteenth Int’l Conference on Machine Learning (ICML’96), pages 346-354, Bari, Italy, July 1996.

[P97] J. Pitkow. In search of reliable usage data on the www. In Sixth International World Wide Web Conference, pages 451-463, Santa Clara, CA, USA, April 1997.

[PHM+00] J. Pei, J. Han, B. Mortazavi-Asl and H. Zhu. Mining Access Patterns Efficiently from Web Logs. In Proceedings Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'00), pages 396-407, Kyoto, Japan, April 2000.

[PZL98] S. Parthasarathy, M. J. Zaki, and W. Li. Memory Placement Techniques for Parallel Association Mining. In 4th Int’l Conference on Knowledge Discovery and Data Mining (KDD’98), pages 304-308, New York, New York, August 1998.

[S99] S. Sarawagi. Explaining Differences in Multidimensional Aggregates. In Proc. Of 25th Int’l Conference on Very Large Data Bases (VLDB’99), pages 42-53, Edinburgh, Scotland, U.k., September 1999.

[S00] S. Sarawagi. User-Adaptive Exploration of Multidimensional Data. In Proc. Of 26th Int’l Conference on Very Large Data Bases (VLDB’00), pages 307-316, Cairo, Egypt, September 2000.

[SA96] R. Srikant, and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Fifth Int’l Conference on Extending Database Technology (EDBT’96), pages 3-17, Avignon, France, March 1996.

[SAM98] S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven Exploration of OLAP Data Cubes. In Proc. Of Extending Database Technology (EDBT’98), pages 168-182, Valencia, Spain, March 1998.

[SCD+00] J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web Usage Mining: Discovery and Application of Usage Patterns from Web Data. SIGKDD Explorations, 1(2):12-23, January 2000.

[SS00] S. Sarawagi and G. Sathe. I3: Intelligent, Interactive Investigation of OLAP Cubes. In Proc. Of the 2000 ACM SIGMOD Int’l Conference on Management of Data, page 589, Dallas, Texas, USA, May 2000.

[SZA+97] C. Shahabi, A. Zarkesh, J. Adibi, and V. Shah. Knowledge Discovery from Users Web-Page Navigation. In Proceeding of the IEEE RIDE97 Workshop, pages 20-29, Birmingham, England, April 1997.

[WCA99] World Wide Web Committee Web Usage Characterization Activity. http://www.w3c.com/WCA, 1999.

[WEB95] Software Inc. Webtrends. http://www.webtrends.com, 1995.

82

[WYB98] K. Wu, P. S. Yu, and A. Ballman. Speedtracer: A web usage mining and analysis tool. IBM Systems Journal, 37(1):89-105, 1998.

[YJM+96] T. W. Yan, M. Jacobsen, H. G. Molina, and U. Dayal. From User Access Patterns to Dynamic Hypertext Linking. In Proceedings of the 5th International Wrold-Wide Web Conference, pages 7-11, Paris, France, May 1996.

[Z98] M. J. Zaki. Scalable Data Mining for Rules. Ph.D. Thesis, University of Rochester, 1998.

[Z99] O. R. Zaïane. Resource and Knowledge Discovery from the Internet and Multimedia Repositories. PhD thesis, School of Computing Science, Simon Fraser University, March 1999.

[Z00] M. J. Zaki. Parallel Sequence Mining on Shared-Memory Machines. In Large-Scale Parallel Data Mining, M. J. Zaki and C.-T. Ho, editors, Lecture Notes in Artificial Intelligence (LNAI 1759), Springer-Verlag, Berlin, 2000.

[ZXH98] O. R. Zaïane, M. Xin, J. Han, Discovering Web Access Patterns and Trends by Applying OLAP and Data Mining Technology on Web Logs. In Proceedings of Advances in Digital Libraries Conference (ADL’98) , pages 19-29, Santa Barbara, CA, USA, April 1998.

discovering and mining user web-page ... and mining user web-page traversal patterns by behzad...

Documents