Web Usage Mining

Chris Yang

Three Phases of Web Usage Mining Discover usage patterns from Web data to

understand and better serve the needs of Web-based applications (Srivastava et al., 2000)

Three phasesPreprocessingPattern discovery Pattern analysis

Motivation of Web Usage Mining

Bring vendor and end customer in electronic commerce closer

Mass customizationVendor may personalize his product message

for individual customers at a massive scale

Data Sources

Sever Web server log explicitly records the browsing

behavior of site visitors and reflects the access of a Web site by multiple users

Formats Common log Extended log

Web log may not be completely reliable Caching – files stored at client but not accessed from server Information pass through the POST method will not be

available in a server log

The Web's RPC on top of TCP/IP It is stateless, which means that a separate connection is made for

every request Simple to implement, yet incur overhead

Each HTTP client/server interaction consists of a single request/reply interchange

HTTP request HTTP response

HTTP request message consists of :1. request line

a) method or command to apply to a server resource e.g. GET, POST

b) URL (without protocol and server domain name)

c) the protocol version used by the client, e.g. HTTP/1.0

2. request header fields Pass additional information about the request and the client itself to the

server - much like RPC parameters Each header filed consists of a name, followed by “:” and the field value

3. the entity body (optional) Clients use it to pass bulk information to the server (CGI)

• Examples of HTTP methods• GET - retrieve the specified URL• POST - send this data to the

specified URL• Examples of HTTP header fields

• Accept - lists acceptable MIME type/subtype contents

• User-Agent - provides client browser information Note: crlf: carriage-return/line-feed

HTTP response message1. response header line

– HTTP version, the status of the response, and an explanation of the returned status

2. response header fields– Information that describes the server's

attributes and the returned HTML document to client

3. entity body– Contains an HTML document that a client has

requested Each HTML document needs a separate

request message– stateless

• The result code 200 indicates that the request is successful.

Data Source - Server

Web server log in extended log format

Data Source - Server

Packet sniffing Monitor network traffic coming to a Web server Extract usage data directly from TCP/IP packets

Cookies Tokens generated by the Web server for individual client browsers to

automatically track the site visitor HTTP protocol is stateless which makes tracking individual users

difficult Cookies rely on implicit user cooperation

Query data CGI scripts

URI for CGI programs may contain additional parameter values to be passed to CGI applications

Data Source - Client

Client Remote agent (e.g. Javavscripts or Java applets) Modifying the source code of an existing browser to

enhance data collection capabilities Difficulty - Require client cooperation to enable the

functionality of Javascripts and Java Applets or voluntarily use of the modified browsers

Data Source - Proxy

ProxyCaching between client browsers and Web

serversProxy traces may reveal the actual HTTP

request from multiple clients to multiple Web servers

It helps to characterizing the browsing behavior of a group of anonymous users sharing a common proxy server

Data Abstractions Data from server, client and proxy helps us to construct data abstractions

Users, server sessions, episodes, click-streams, and page views W3C Web Characterization Activity (WCA) has drafted a Web term definitions

relevant to Web usage (http://www.w3.org/WCA) User – a single individual that is accessing file from one or more Web servers through

a browser Difficulty to identify user – a user may access through different machines or use more than

one agent on a single machine Page view – page view consists of every file that contributes to the display on a

user’s browser at one time Includes several files such as frames, graphics, and scripts When users download a “Web page” by clicking an anchor text or submitting an URL,

he/she is not aware of how many frames, graphics, images, or scripts he/she is receiving Click-stream – a sequential series of page view requests

Server may not have all information to obtain the click-stream Page views through client or proxy-level cache are not available at server

User session – the click-stream of page views for a single user across the entire Web In practice, only the portion of user session that is accessing a particular site can be

identified. Server session – the set of page views in a user session for a particular Web site Episode – any semantically meaningful subset of a user or server session

Phase 1 –Preprocessing

Usage Preprocessing Due to the incompleteness of available data, usage preprocessing is a

difficult task Typical problems

Unless client side tracking is used, only IP address, agent, and server-side click stream are available

Single IP address / Multiple server sessions Internet service providers (ISPs) have a pool of proxy servers A proxy server may have several users accessing a Web site, potentially over the

same time period Multiple IP address / Single server sessions

Some ISPs or privacy tools randomly assign each request from a user to one of several IP addresses

Multiple IP address / Single user A user accesses the Web from different machines (multiple IP address from

session to session) Multiple agent / Single user

A user uses more than one browser appears as multiple users

Usage Preprocessing

Segmenting click-stream into sessions It is difficult to know when a user leave a Web site A thirty-minute time out is often used (Catledge and

Pitkow, 1995) In some cases, session ID is embedded in each URI,

session is defined by content server Content from user action

Content servers maintain state variables for each active session, the information to determine the content by a user request is not always available

Using referrer and agent information, 4 sessions are determined

Content Preprocessing and Structure Preprocessing Content Preprocessing

Converting the text, image, scripts, and other multimedia files into forms that are useful for Web usage mining

Classification By content By intended use (Cooley et al., 1999; Pirolli et al., 1996)

Convey information, gather information from user, allow navigation, or combination

Structure Preprocessing Hyperlinks between page views

Phase 2 – Pattern Discovery

Statistical Analysis Perform descriptive statistical analysis (such as mean, median,

frequency etc.) on page views, viewing time and length of a navigational path from session file

Web traffic analysis tools produce periodic reports Most frequently accessed pages Average view time of a page Average length of a path through a site

Useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, and providing support for marketing decisions

Association Rules Relate pages that are most often referenced together in a single

server session Sets of pages that are accessed together with a support value

exceeding some specified threshold These page may not directed connected by hyperlinks Useful for Web designers to restructure their Web sites These rules serve as a heuristic for prefetching documents in

order to reduce user-perceived latency when loading a page from a remote site

Clustering Group together a set of items having similar

characteristics Usage clusters

Establish groups of users exhibiting similar browsing patterns Useful for inferring user demographics in order to perform

market segmentation Page clusters

Discover groups of pages that have related content Useful for search engines and Web assistance providers

Classification Mapping a data item into one of several predefined classes Develop a profile of users belonging to a particular class or

category Requires feature extraction and selection that best describe the

properties of a given class or category Techniques

Decision tree classifiers, naïve Bayesian classifier, k-nearest neighbor classifiers, support vector machines, etc.

E.g. 30% users who place online orders in /Product/Music are in the 19-

25 age group and live on the West coast

Sequential Pattern Find inter-session patterns

The presence of a set of items is followed by another item in a time-ordered set of sessions or episode

Useful for predicting future pattern in order to place advertisements for a certain user groups

Temporal analysis Trend analysis, change point detection, or similarity analysis

Dependency Modeling Develop a model capable of representing significant

dependencies among the various variables in the Web domain

E.g. A model representing the different stages a visitor undergoes

while shopping in an online store based on the action chosen (from casual visitor to a serious potential buyer)

Techniques Hidden Markov models, Bayesian belief network

Phase 3 – Pattern Analysis

Filter out uninteresting rules or patterns from the set found in the pattern discovery phase

Major Application Areas for Web Usage Mining (Sriastava et al., 2000)

Architecture of the WebSIFT system (Cooley et al., 1999)

WUM – Web Usage MinerNavigation behavior in Web sites(Berendt and Spiliopoulou, 2000) Web site is a network of structurally or semantically

interrelated nodes (built in a way that reflects the designers’ intuition).

Quality of a Web site The conformance of the Web site’s structure to the intuition of

each group of visitors accessing the site. Intuition of visitors is indirectly reflected in their navigation behavior

(represented in the browsing pattern) Measure of the quality of Web site

Quality of service (e.g. response time) Quality of navigation Accessibility Information utility Ease of use Attractiveness of the presentation metaphor

Sequence Mining Sequence mining supports the discovery of frequent paths composed of not

necessarily adjacent pages Given a collection of transactions ordered in time (each transaction contains a set of

items), discover sequences of maximal length with support above a given threshold A sequence is an ordered list of elements, an element being a set of items appearing

together in a transaction Elements need not be adjacent in time but their ordering in a sequence must not

violate the time ordering of the support transactions Example

Considering a Web site with pages W, A, B, C, D, E and there is a link from W to D WABC (1000 times), WDBC (100 times), WABDEC (400 times) Frequency threshold = 25% WD appears 500 (400+100) times (=33%) and above threshold

In the above example, link from W to D only used 1 out of 5 cases. Therefore, sequence mining is not useful in understanding the usefulness of a hyperlink.

In WUM, a navigation pattern is a directed acyclic graph composed of a group of sequences that conform to a template

The purpose is to determine the usage of which links is responsible for the frequency of sequences

WUM – Navigation Sequences and Navigation Patterns A session is a directed list of page accesses performed by a user during his/her visit

in a site A navigation pattern is a structure that

Emphasizes the common parts among the sessions Does not purge the dissimilar parts Annotates both common and non-common parts with quantitative information

P is a set of Web pages in the site If the site is dynamic nature, P is the set of all pages that can be generated

D is a dataset of sessions A session is a directed list of elements from P A sequence of length n is a vector s P N (N is a set of positive integers) U = P N Example

P = {a,b,c,d,e,f,g,h} ab, ac, abcde, bcbf, abdfhe are sessions appearing in D

No. Session Sequence Appearances

1 ab (a,1) (b,1) 40

2 ac (a,1) (c,1) 20

3 abcde (a,1) (b,1) (c,1) (d,1) (e,1) 30

4 bcbf (b,1) (c,1) (b,2) (f,1) 5

5 abdfhe (a,1) (b,1) (d,1) (f,1) (h,1) (e,1) 10

Generalized sequences

“wildcard” [low; high] is matched by any sequence of elements that has length at least low and at most high (low 0, high low)

“wildcard” − its range is not of interest A generalized sequence g is a vector g1 g2 … gn

The number of non-wildcard elements in g is the length of g, length(g) Example

(a,1) (b,1) [2;4] (e,1) matches with Session 3 and 5 The group of sequences that match g constitute the “navigation

pattern of g” navp(g) The hits of g, hits(g), is the number of sequences that matched by

g. confidence(gi, gj, g) = hits(g1 … gi-1 gi) / hits(g1 gj)

g = (a,1) (b,1) [2;4] (e,1) hits(g) = 30 + 10 = 40

No. Session Sequence Appearances

1 ab (a,1) (b,1) 40

2 ac (a,1) (c,1) 20

3 abcde (a,1) (b,1) (c,1) (d,1) (e,1) 30

4 bcbf (b,1) (c,1) (b,2) (f,1) 5

5 abdfhe (a,1) (b,1) (d,1) (f,1) (h,1) (e,1) 10

Aggregate tree and log

navp(g) is modeled as a tree structure (aggregate tree)

Aggregate log

No. Session Sequence Appearances

1 ab (a,1) (b,1) 40

2 ac (a,1) (c,1) 20

3 abcde (a,1) (b,1) (c,1) (d,1) (e,1) 30

4 bcbf (b,1) (c,1) (b,2) (f,1) 5

5 abdfhe (a,1) (b,1) (d,1) (f,1) (h,1) (e,1) 10

Discover navigation pattern

A “template” is a vector comprised of variable ranging over the domain U and of wildcards

A mining query is a template declaration accompanied by a conjunction of constraints on the permissible values of the template variables

Example NODE AS x y z

TEMPLATE x y [2;4] z AS t WHERE x.support 85

AND (y.support / x.support ) 0.8