data miing and knowledge discvoery - web data mining

Web Mining Research at the Center for Web Intelligence

Bamshad MobasherCenter for Web Intelligence

School of Computer Science, Telecommunication, and Information SystemsDePaul University, Chicago

Bamshad MobasherCenter for Web Intelligence

School of Computer Science, Telecommunication, and Information SystemsDePaul University, Chicago

2

What is Web Mining

i From its very beginning, the potential of extracting valuable knowledge from the Web has been quite evident4Web mining is the collection of technologies to fulfill this potential.

Web Mining Definition

application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources.

application of data mining and machine learning techniques to extract useful knowledge from the content, structure, and usage of Web resources.

3

A Taxonomy for Web Mining

Web Mining

DatabaseApproachDatabaseApproach

Web ContentMining

IR / IAApproach

Intelligent Search ToolsInfo. Filtering/CategorizationPersonalized Info. AgentsCo-Citation AnalysisContent-Based Filtering

Web Query SystemsConcept HierarchiesStructure DiscoveryOntology LearningWeb Info. Integration

E-Commerce Data AnalysisSite Evaluation/optimizationE-Metrics and Web AnalyticsUser Modeling/ProfilingWeb PersonalizationWeb Communities

Web Structure Mining

Web Structure Mining

Web UsageMining

Web UsageMining

4

Web Usage Mining

i The Problem4 Analyze Web navigational data to find how the Web site is used by Web

usersi Current Approaches

4 Statistical reports4 OLAP4 Data mining/Machine Learning techniques

i Applications4 Web server caching4 Link prediction and analysis4 Web personalization4 Web site adaptation4 Web site evaluation and reorganization

i Challenge4 Quantitatively capture Web users’ common interests and characterize their

underlying tasks

5

Web Usage Mining as a Process

Web &ApplicationServer Logs

Data CleaningPageview Identification

SessionizationData Integration

Data Transformation

Data Preprocessing

UserTransactionDatabase

User/Session ClusteringPageview ClusteringCorrelation Analysis

Association Rule MiningSequential Pattern Mining

Usage Mining

Patterns

Pattern FilteringAggregation

Characterization

Pattern Analysis

Site Content& Structure

Domain Knowledge

AggregateUser Models

Data Preparation Phase Pattern Discovery Phase

6

Common Data Mining Tasks

iClustering:4 Automatically group together users with similar purchasing or navigational

patternshUser / customer segments

4 Automatically group together items based on co-occurrence in user sessions4 Automatic creation of concept or functional hierarchies for the site

iClassification4 categorizing pages or items into a concept hierarchy 4 classifying users into behavioral groups (e.g., browser, likely to purchase,

loyal customer, etc.)

7

Common Data Mining Tasks

iAssociation Rules4 Associating presence of a set of items with other sets of items

hX Y, where X and Y are sets of itemshSupport of the itemset X ∪ Y: Pr(X,Y); Confidence of rule: Pr(Y|X)

4 Examples:h30% of clients who accessed /special-offer.html, placed an online order in

/products/software/

iSequential Patterns / Path Analysis4 Finding common sequences of events/items appearing frequently in transactions

h “x% of the time, when ABC appears in a transaction, C appears within z transactions”)

h40% of people who bought the book “How to cheat IRS” booked a flight to South America 6 months later

h15% of visitors followed the path home > * > software > * > shopping cart > checkout

8

Example: Path Analysis for Ecommerce

Visit

Search(64% successful)

No Search

Last Search SucceededLast Search Failed

10%90%

Avg sale per visit: 2.2X

Avg sale per visit: $X

Avg sale per visit: 2.8XAvg sale per visit: 0.9X

70% 30%

9

Example: Association Analysis for Ecommerce

Product Association Lift Confidence

WebsiteRecommended Products

J Jasper Towels

FullyReversibleMats

456 41%Egyptian CottonTowels

White CottonT-Shirt Bra

PlungeT-Shirt Bra 246 25%

Black embroidered underwired bra

Confidence 1.4%

Confidence 1%

i Confidence: 41% who purchased Fully Reversible Mats also purchased Egyptian Cotton Towelsi Lift: People who purchased Fully Reversible Mats were 456 times more likely to purchase the

Egyptian Cotton Towels compared to the general population

10

Web Usage Mining: clustering example

i Transaction Clusters: 4 Clustering similar user transactions and using centroid of each cluster as

a usage profile (representative for a user segment)

Current Degree Descriptions 2002/programs/0.88

M.S. in Distributed Systems program description

/programs/2002/gradds2002.asp0.82

SE 450 course description in SE program

/programs/courses.asp?depcode=96&deptmne=se&courseid=450

0.85

Web page of a lecturer who thought the above course

/people/facultyinfo.asp?id=2900.97

SE 450 Object-Oriented Development class syllabus

/courses/syllabus.asp?course=450-96-303&q=3&y=2002&id=290

1.00Pageview DescriptionURLSupport

Sample cluster centroid from dept. Web site (cluster size =330)

11

Latent Variable Models

i Assume the existence of a set of latent (unobserved) variables (or factors) which “explain” the underlying relationships between two sets of observed variables.

Probabilistic Latent Semantic Analysis (PLSA) :

Probabilistically determine the association between each latent factor and pages, or between each factor and users.

Latent factors correspond to distinguishable patterns usually associated with performing certain navigational task.

p1 p2 pn

u1 u2 um

p3

u3

z1 z2 zk

Pageviews

Latent Factors

User sessions

12

“Task-Oriented” User TrackingExample from CTI Data

0.3994Task 25

0.0489Task 20

Online application - finishDepartment main page

Top factors given this user

Online application – step 2

0.4527Task 3

Application – status check

A real user session (page listed in the order of being visited)

Admission main pageWelcome information – Chinese versionAdmission info for international students

Admission - requirementsAdmission – mail request

Admission – orientation infoAdmission – F1 visa and I20 info

Online application - startOnline application – step 1

0.0458Task 26

ProgramsOnline application – step 1

PageNameDepartment main page

Admission requirementsAdmission main page

Admission costs

Admission – international students…

…

PageNameOnline application – startOnline application – step1Online application – step2Online application - finish

Department main page

Online application - submit

13

Web Personalization

i The Problem4 dynamically serve customized content (pages, products, recommendations,

etc.) to users based on their profiles, preferences, or expected interests

i Current Approaches4 collaborative filtering

hgive recommendations to a user based on ratings of “similar” usershmost successful and wide-spread approachhbut, suffers from scalability problems given many items and users

4 content-based filteringh track pages the user visits and recommend other pages with similar contenthmay miss relationships among objects not based on content similarity

4 rule-based filteringhprovide content to users based on predefined rules (e.g., “if user has

clicked on A and the user’s zip code is 90210, then add a link to C”)h relies on static profiles for users obtained through explicit feedback

14

Web Mining Approach to Personalization

i Basic Idea4 generate aggregate user models (usage profiles) by discovering user

access patterns through Web usage mining (offline process)

4 match a user’s active session against the discovered models to provide dynamic content (online process)

i Advantages4 no explicit user ratings or interaction with users4 helps preserve user privacy, by making effective use of anonymous data4 enhance the effectiveness and scalability of collaborative filtering4 more accurate and broader recommendations than content-only approaches

15

Automatic Web Personalization:

Recommendation EngineRecommendation Engine

Web Server Client BrowserActive Session

RecommendationsIntegrated User Profile

AggregateUsage Profiles

<user,item1,item2,…>

Stored User Profile

Domain Knowledge

16

Quantitative Evaluation of Recommendation Effectiveness

i Two important factors in evaluating recommendations4 Precision: measures the ratio of “correct” recommendations to all

recommendations produced by the system

hlow precision would result in angry or frustrated users

4 Coverage: measures the ratio of “correct” recommendations to all pages/items that will be accessed by user

hlow coverage would inhibit the ability of the system to give relevant recommendations at critical points in user navigation

i Transactions Divided into Training & Evaluation Setshtraining set is used to build models (generation of aggregate profiles,

neighborhood formation)

hevaluation set is used to measure precision & coverage

h10-Fold Cross Validation generally used in the experiments

17

Problems with Web Usage Mining

i New item problem4 Patterns will not capture new items recently added4 Bad for dynamic Web sites, and in the context of collaborative filtering

i Poor machine interpretability4 Hard to generalize and reason about patterns

hNo domain knowledge used to enhance resultshE.g., knowing a student is interested in a particular program, we could

recommend the prerequisites, core or popular courses in this program, related courses in other programs, etc.

i Poor insight into the patterns themselves 4 The nature of the relationships among items or users in a pattern is not directly

available

i Solution: Integrate Semantic Knowledge with Web Usage Mining

18

Ontology-Based Usage Mining

i Approach 1: Ontology-Enhanced Transactions4 Initial transaction vector: t = <weight(p1,t), …, weight(Pn,t)>4 Transform into content-enhanced transaction:

t = <weight(o1,t), …, weight(or,t)>4 The structured objects o1, …, or are instances on ontology entities

extracted from pages p1, …, pn in the transaction4 Now mining tasks can be performed based on ontological similarity among

user transactions

i Approach 2: Ontology-Enhanced Patterns4 Discover usage patterns in the standard way4 Transform patterns by creating an aggregate representation of the

patterns based on the ontologyhRequires the categorization of similar objects into ontology classeshAlso requires the specification of different aggregation/combination

function for each attribute of each class in the ontology

19

Ontology-Based Pattern Aggregation

Year

{2000}

{1999}

{2002}

{S: 0.6; W: 0.4}

{S: 0.5; T:0.5}

{S: 0.7; T: 0.2; U: 0.1}

Name

Genre-All

Romance Comedy

Romantic Comedy

Kid&Family

Romance

Genre-All

Comedy

Genre-All

[1999, 2002]

Year

{S: 0.58; T: 0.27; W:0.09; U: 0.05}

{A: 0.5; B: 0.35; C: 0.15}

ActorGenreName

Genre-All

Romance Comedy

Object Extraction

Ontology-Based

Aggregation

Comedy

Genre ActorUsage profile

{A}Movie 1:0.50 Movie1.html0.35 Movie2.html0.15 Movie3.html

{B}Movie 2:

{C}Movie 3:

Semantic Usage pattern

20

Example: Semantically Enhanced Collaborative Filtering

i Basic Idea:4 Extend item-based collaborative filtering to incorporate both similarity

based on ratings (or usage) as well as semantic similarity based on domain knowledge

i Structured semantic knowledge about items4 Extracted automatically from the Web based on domain-specific reference

ontologies4 Used in conjunction with user-item mappings to create a combined

similarity measure for item comparisons4 Singular value decomposition used to reduce noise in the semantic data

i Semantic combination threshold4 Used to determine the proportion of semantic and rating (or usage)

similarities in the combined measure

21

Web Content Mining

i Discovery of meaningful patterns from page content and meta data across the Web or in a particular Web site

i Examples of Techniques Used4 text mining4 document categorization and clustering4 semantic analysis

i Applications4 content-based filtering4 Information extraction4 intelligent interface agents4 discovery of user profiles4 ontology acquisition and learning4 recommendation systems4 opinion extraction from the Web4 contextual information access

22

Contextual Information Access

i Integrating multiple sources of evidence4 Short-term user context; Long-term user profiles; Domain knowledge

23

ARCH Background

i ARCH - Adaptive Agent for Retrieval Based on Concept Hierarchies

4 Essence of the system is to incorporate domain-specific concept hierarchies with interactive query formulation

4 Query enhancement in ARCH uses two mutually-supporting techniques:

hSemantic – using a concept hierarchy to interactively disambiguate and expand querieshBehavioral – observing user’s past browsing behavior for user profiling

and automatic query enhancement

Domain Knowledge & Search Context

25

Overview of the System

i The system consists of an offline and an online component

iOffline component:4 Handles the learning of the concept hierarchy4 Handles the learning of the user profiles

iOnline component:4 Displays the concept hierarchy to the user4 Allows the user to select/deselect nodes4 Generates the enhanced query based on the user’s interaction with the

concept hierarchy

26

Query Enhancement Mechanism

27

Example from Yahoo Hierarchy

Term Vector for "Genres:"

music: 1.000blue: 0.15new: 0.14artist: 0.13jazz: 0.12review: 0.12band: 0.11polka: 0.10festiv: 0.10celtic: 0.10freestyl: 0.10

Term Vector for "Genres:"

music: 1.000blue: 0.15new: 0.14artist: 0.13jazz: 0.12review: 0.12band: 0.11polka: 0.10festiv: 0.10celtic: 0.10freestyl: 0.10

28

Aggregate Representation of Nodes in the Hierarchy

i A node is represented as a weighted term vector:4centroid of all documents and subcategories indexed under the node

n = node in the concept hierarchyDn = collection of individual documents Sn = subcategories under nTd = weighted term vector for document d indexed under node nTs = the term vector for subcategory s of node n

(( ) / | | / | | 1n n

n d n s nd D s S

T T D T S∈ ∈

= + +

∑ ∑

29

Online Component – User Interaction with the Hierarchy

i The initial user query is mapped to the relevant portions of hierarchy4 user enters a keyword query4 system matches the term vectors representing each node in the hierarchy

with the keyword query4 nodes which exceed a similarity threshold are displayed to the user, along

with other adjacent nodes.

i Semi-automatic derivation of user context4 ambiguous keyword might cause the system to display several different

portions of the hierarchy 4 user selects categories which are relevant to the intended query, and

deselects categories which are not

30

Generating the Enhanced Query

i Based on an adaptation of Rocchio's method for relevance feedback4 Using the selected and deselected nodes, the system produces a

refined query Q2:

4 each Tsel is a term vector for one of the nodes selected by the user, 4 each Tdesel is a term vector for one of the deselected nodes4 factors α, β, and γ are tuning parameters representing the relative

weights associated with the initial query, positive feedback, and negative feedback, respectively such that α + β - γ = 1.

2 1 sel deselQ Q T Tα β γ= ⋅ + ⋅ − ⋅∑ ∑

31

An Example

-

MusicMusic

GenresGenresArtistsArtists New ReleasesNew Releases

BluesBlues JazzJazz New AgeNew Age . . .

DixielandDixieland

+

+

+ . . .

Initial Query “music, jazz”

Selected Categories“Music”, “jazz”, “Dixieland”

Deselected Category“Blues”

Portion of the resulting term vector:

music: 1.00, jazz: 0.44, dixieland: 0.20, tradition: 0.11,band: 0.10, inform: 0.10, new: 0.07, artist: 0.06

music: 1.00, jazz: 0.44, dixieland: 0.20, tradition: 0.11,band: 0.10, inform: 0.10, new: 0.07, artist: 0.06

32

Another Example – ARCH Interface

i Initial query = pythoni Intent for search = python as a snakei User selects Pythons under Reptilesi User deselects Python under

Programming and Development and Monty Python under Entertainment

i Enhanced query:

33

Experimental Evaluation

i Performed testing using a small subset of the Yahoo hierarchy4 Approximately 50 nodes4 Domains: Business and Economy, Computers and Internet,

Entertainment, Science

i Search scenarios4 Single keyword: python, bat, hardware, bug, mouse4 Two keywords: python snake, bat mammal, baseball bat, computer

hardware, home hardware, bug surveillance, computer mouse4 Selection and deselection of nodes based on specific search intent

i Ten signal documents and ten noise documents collected for each search scenarios4 indexed using tf.idf weights

34

Experiment Evaluation, cont.

i Two types of search

4 Simple Query SearchhUsed the user’s original query

4 Enhanced Query SearchhUsed the enhanced query that was generated by ARCHhSelection and deselection of nodes based on search scenarios

iMatching - cosine similarity measure

i Comparison - precision and recall

35

Results

Average precision for Enhanced Query versus Simple Query Search using two keywords

Enhanced ARCH Query vs. Simple Query Search

0.00.20.40.60.81.01.2

0 10 20 30 40 50 60 70 80 90 100

Similarity Threshold (%)

Pre

cisi

on EnhancedSimple Precision = proportion of retrieved

documents actually relevant

Average recall for Enhanced Query versus Simple Query Search using two keywordsEnhanced ARCH Query vs. Simple Query Search

0.00.20.40.60.81.01.2

0 10 20 30 40 50 60 70 80 90 100

Similarity Threshold (%)

Rec

all Enhanced

Simple Recall = proportion of relevant documents actually retrieved

36

User Profiles & Information Context

i Can user profiles replace the need for user interaction?4 Instead of explicit user feedback, the user profiles are used for the

selection and deselection of concepts4 Each individual profile is compared to the original user query for

similarity4 Those profiles which satisfy a similarity threshold are then compared to

the matching nodes in the concept hierarchyhmatching nodes include those that exceeded a similarity threshold

when compared to the user’s original keyword query. 4 The node with the highest similarity score is used for automatic

selection; nodes with relatively low similarity scores are used for automatic deselection

37

Generation of User Profiles

i Profile Generation Component of ARCH4 passively observe user’s browsing behavior4 use heuristics to determine which pages user finds “interesting”

htime spent on the page (or similar pages)hfrequency of visit to the page or the sitehother factors, e.g., bookmarking a page, etc.

4 implemented as a client-side proxy server

i Clustering of “Interesting” Documents4 ARCH extracts feature vectors for each profile document4 documents are clustered into semantically related categories

hwe use a clustering algorithm that supports overlapping categories to capture relationships across clustershalgorithms: overlapping version of k-means; hypergraph partitioning

4 profiles are the significant features in the centroid of each cluster

38

Results Based on User Profiles

Simple vs. Enhanced Query Search

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

0 10 20 30 40 50 60 70 80 90 100Threshold (%)

Re

call

Simple Query Single Keyword

Simple Query Two Keywords

Enhanced Query with User Profiles

Simple vs. Enhanced Query Search

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

0 10 20 30 40 50 60 70 80 90 100

Threshold (%)

Pre

cisi

on

Simple Query Single KeywordSimple Query Two KeywordsEnhanced Query with User Profiles

39

Future Work on ARCH

i Integrate the system with a specific search engine on the World Wide Web4Weighted term vector queries must be translated into Boolean queries

i Experiment using other concept hierarchies4 Specific domains (e.g., medical)4 Design is intended to be independent of specific concept hierarchy

i Fully integrate and evaluate the User Profile Modulei Additional information4 http://www.ahusieg.com/arch

40

Some Ongoing Projects at CWI

iWeb usage mining and Web analytics4 Intelligent techniques for sessionization and pageview identification4 Integration of usage, content, and structure4 Ontology-based Web usage mining (Semantic Web Usage Mining)

iWeb personalization4 Data mining techniques for personalization4 Integrating domain knowledge (ontologies) in user modeling4 Hybrid recommendation systems4 Secure Personalization

iWeb content mining / Information Agents4 Automatically learning domain-specific ontologies from Web sites4 Integrating domain ontologies, learned user profiles, and short-term user

interests for “contextual information access” on the WebhThe ARCH project