web classification ontology and taxonomy. 2 references using ontologies to discover domain-level web...

54
Web classification Ontology and Taxonomy

Post on 20-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

Web classification

Ontology and Taxonomy

Page 2: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

2

References

Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu

Learning to Construct Knowledge Bases from World Wide Web. {M. Craven, D. DiPasquo, A. Mitchell, K. Nigam, S

Slattery} Carnegie Mellon University-Pittsburg-USA; {D. Freitag A. McCallum} Just Reserch-Pittsburg-USA

Page 3: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

3

Definitions

Ontology An explicit formal specification of how to

represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

Taxonomy a classification of organisms into groups

based on similarities of structure or origin etc

Page 4: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

4

Goal

Capture and model behavioral patterns and profiles of users interacting with a web site.

Why? Collaborative filtering Personalization systems Improve organization and structural of the site Provide dynamic recommendations (www.recommend-

me.com)

Page 5: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

5

Algorithm 0 (by Rafa’s brother: Gabriel)

Recommend pages viewed by other users with similar page ranks.

Problems New item problem Doesn’t consider content similarity nor

item-to-item relationships.

Page 6: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

6

User session

User session s: <w(p1,s),w(p2,s),..,w(pn,s)> W(pi,s) is a weight in session s, associated with

page pi

Session clusters {cl1, cl2,…} cli is a subset of the set of sessions

Usage profile prcl={<p, weight(p,prcl) : weight(p,prcl) ≥ μ} Weight(p,prcl)=(1/|cl|) *∑w(p,s)

Page 7: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

7

Algorithm 11. For every session, create a vector containing

the viewed pages and a weight for each page.2. Each vector represent a point in a N-

dimensional space, so we may identify the clusters.

3. For a new session, check to which cluster this vector/point belongs, and recommend high scores pages of this cluster

Problems New item problem Doesn’t consider content similarity nor item-to-

item relationships

Page 8: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

8

Algorithm 2: keyword search Solves new item problem. Not good enough

A page can contain info for more than 1 object. A fundamental data can be pointed by the

page, not included. What exactly is a keyword.

Solution Domain ontologies for objects

Page 9: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

9

Domain Ontologies Domain-Level Aggregate Profile: Set of pseudo

objects each characterizing objects of different types occurring commonly across the user sessions.

Class - C Attributes – a: < Da, Ta, ≤a, Ψa>

Ta type of attribute DaDomain of the values for a (red, blue,..) ≤a ordering relation among Da

Ψa combination function

Page 10: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

10

Example – movie web site Classes:

movies, actors, directors, etc Attributes:

Movies: title, genre, starring actors Actors: name, filmography, gender, nationality

Functions: Ψactor(<{S,0.7; T, 0.2; U,0.1},1>, <{S,0.5;

T,0.5),0.7>) = sumi(wi*wo)/ sumi(wi) Ψyear({1991},{1994}) = {1991,1994}

Ψis_a({person,student},{person,TA})= {person}

Page 11: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

11

Movie

Title Genre Actor year

About a boy {Romantic; Comedy; Family}

{H. Grant:0.6; R. Weisz: 0.1;

T.Collete: 0.3}2002

Page 12: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

12

Creating an Aggregated Representation of a usage profile

pr={<o1wo1>, …,<onwon

>}

Oi object; woi=significance on the profile pr

Let assume all the object are instances of the same class

Create a new virtual object o’, with attributes ai’= Ψi(o1,…,on)

Page 13: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

13

Item level usage profileNameGenreActorYear

{A}Genre-allRomance

Romance Comedy

ComedyKids & family

{S:0.7; T:0.2; U:0.1}

{2002}

{B}Genre-allRomanceComedy

{S:0.5, T:0.5}

{1999}

{C}Genre-allRomance

{W:0.6,S:04}

{2001}

{A:1; B:1; C:1}

Genre-allRomance

{S:0.58; T:0.27;

W:0.09; U:0.05}

{1999 ,2002}

Page 14: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

14

A real (estate property) example

Property

Price Location Room num

}300K{ }Chicago{ }5{

Page 15: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

15

Item Level Usage Profile

WeightPriceLocationRoom num

1475KChicago5

0.7299KChicago4

0.18272kEvanston4

0.1899KChicago3

1365K{Chicago, Evanston}

4

Page 16: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

16

Algorithm 2 Do not just recommend other items

viewed by other users, recommend items similar to the class representative.

Advantages: More accuracy Need less examples No new item problem Consider also content similarity (item-to-

item relationship).

Page 17: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

17

Item Level Usage Profile

Weight

PriceLocationRoom#

1475KChicago5

0.7299KChicago40.180.18272k272kEvanstonEvanston44

0.180.1899K99KChicagoChicago33

1365K{Chicago, Evanston}4

1370KChicago4

Page 18: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

18

Final Algorithm

Given a web site1. Classify it contents into classes and

attributes.2. Merge the objects of each user profile

and create a pseudo object. 3. Recommend according to this pseudo-

object.

Page 19: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

19

Problems A per-topic solution Found patterns can be incomplete User patterns may change with time

(for movies) “I loved ET” problem. Need cookies and other methods to

identify users. How is weight calculated? Can need

many examples: “I loved American Beauty” problem.

How to automatically group the web-pages?

Page 20: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

20

Hafsaka?

Page 21: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

21

Constructing Knowledge Base from WWW Goal:

Automatically create computer understandable knowledge base from the web.

Why? To use in the previous described work, and similar Find all universities that offer Java Programming

courses Make me hotel and flight arrangements for the

upcoming Linux conference

Page 22: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

22

…Constructing Knowledge Base from WWW

How? Use machine learning to create information

extraction methods for each of the desired types of knowledge

Apply it, to extract symbolic, probabilistic statements directly from the web: Student-of(Rafa, sdbi)= 99%

Used method Provide an initial ontology (classes and relations) Training examples – 3 out of 4 university sites (8000 web

pages, 1400 web-page pairs)

Page 23: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

23

Fundamentals of CS Home PageInstructors:

JimTom

Jim’s Home PageI teach several courses:

Fundamental of CSIntro to AI

My research includesIntelligent web agents

Example of web pages

Classes: Faculty, Research-project, Student, Staff, (Person), Course, Department, OtherRelations: instructor-of, members-of-project, department-of.

Page 24: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

24

Entity:HomepageHomepage title

activity other

Person:Department _ofProject ofCourse taught byName of

course:instructor ofTAs of

FacultyProject lead byStudent of

JimCourses taught by

Fundamental of csIntro to AIHome-page:…

Fundamental of CSInstructor of: jim, tomHome-page:….

Research ProjectMembers of project

Ontology

Web KB instances

Page 25: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

25

Problem Assumption Class instance one-instance/one-webpage

≠ Multiple instances in one web-page≠ Multiple linked/related web-pages for instance≠ Elvis problem

Relation R(A,B) is represented by: Hyperlinks AB or ACD…B Inclusion in a particular context (I teach

Intro2cs) Statistical model of typical words

Page 26: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

26

To Learn

1. Recognizing class instances by classifying bodies of hypertext

2. Recognizing relations instances by classifying chains of hyperlinks

3. Extract text fields

Page 27: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

27

Recognizing class instances by classifying bodies of hypertext

1. Statistical bag-of-words approach1. Full Text2. Hyperlinks3. Title/Head

2. Learning first order rules Combine the previous 4 methods

Page 28: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

28

Statistical bag-of-words approach

Context-less classification Given a set of classes C={c1, c2,…cN} Given a document consisting of

nn≤2000 words {w1, w2, ..,wn} c*= argmaxc Pr(c | w1,…,wn)

Page 29: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

29

courstudfacustaffresedeptOtheAccuracy

Cours20217001055226.2

Stud042114172051943.3

Facu556118163026417.9

Staff0151400456.2

Rese8910562038413

Dept10831542091.7

Other193273120106493.6

Coverage

82.875.477.18.772.910035

predicted

actual

Page 30: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

30

Statistical bag-of-words approach: Pr(wi|c) log (Pr(wi|c)/Pr(wi|~c))

student faculty coursemy 0.0247 DDDD 0.0138 course 0.0151page 0.0109 of 0.0113 DD:DD 0.013home 0.0104 and 0.0109 homework 0.0106am 0.0085 professor 0.0088 will 0.0088university 0.0061 computer 0.0073 D 0.008computer 0.006 research 0.006 assignments 0.0079science 0.0059 science 0.0057 class 0.0073me 0.0058 university 0.0049 hours 0.0059at 0.0049 DDD 0.0042 assignment 0.0058here 0.0046 systems 0.0042 due 0.0058

reaserch-project department othergroup 0.006 department 0.0179 D 0.0374project 0.0049 science 0.0153 DD 0.0246research 0.0049 computer 0.0111 the 0.0153of 0.003 faculty 0.007 eros 0.001laboratory 0.0029 information 0.0069 hplayD 0.0097systems 0.0028 undergraduate0.0058 uDDb 0.0067and 0.0027 graduate 0.0047 to 0.0064our 0.0026 sta 0.0045 bluto 0.0052system 0.0024 server 0.0042 gt 0.005

Page 31: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

31

Accuracy/Coverage tradeoff for full-text classifiers

Page 32: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

32

Accuracy/coverage tradeoff for hyperlinks classifiers

Page 33: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

33

Accuracy/Coverage for title heading classifiers

Page 34: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

34

Learning first order rules

The previous method doesn’t consider relations between pages

A page is a course home-page if it contains the word textbook and TA and point to a page containing the word assignment.

FOIL is a learning system that constructs Horn clause programs from examples

Page 35: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

35

Relations Has_word(Page). Stemmed words: computer= computing=

comput. 200 occurrences but less than 30% in other class pages Link_to(page,page) m-estimate accuracy= (nc+(m*p))/(n+m)

nc: # of instances correctly classified by the rule N: Total # of instance classified by the rule m=2 P: proportion of instances in trainning set that belongs

to that class Predict each class with confidence = best_match /

total_#_of_matches

Page 36: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

36

New learned rules student(A) :- not(has_data(A)),

not(has_comment(A)), link_to(B,A), has_jame(B), has_paul(B), not(has_mail(B)).

faculty(A) :- has_professor(A), has_ph(A), link_to(B,A), has_faculti(B).

course(A) :- has_instructor(A), not(has_good(A)), link_to(A,B), not(link_to(B, 1)),has_assign(B).

Page 37: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

37

Accuracy/coverage for FOIL page classifiers

Page 38: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

38

Boosting

The best prediction classification depends on the class Combine the predictions using the

measure confidence

Page 39: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

39

Accuracy/coverage tradeoff for combined classifiers (2000 words vocabulary)

Page 40: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

40

Boosting

Disappointing: Somehow it is not uniformly better

Possible solutions Using reduced size dictionaries (next) Using other methods for combining

predictions (voting instead of best_match / total_#_of_matches)

Page 41: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

41

Accuracy/coverage tradeoff for combined classifiers (200 words vocabulary)

Page 42: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

42

Multi-Page segments The group is the longest prefix (indicated in

parentheses) (@/{user,faculty,people,home,projects}/*)/*.{html,htm} (@/{cs???,www/,*})/*.{html,htm} (@/{cs???,www/,*})/ …

A primary page is any page which URL matches: @/index.{html,htm} @/home.{html,htm} @/%1/%1.{html,htm} …

If no page in the group matches one of these patterns, then the page with the highest score for any non-other class is a primary page.

Any non-primary page is tagged as Other

Page 43: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

43

Accuracy/coverage tradeoff for the full text after URL grouping heuristics

Page 44: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

44

Conclusion- Recognizing Classes Hypertext provides redundant information

We can classify using several methods Full text Heading/title Hyperlinks Text in neighboring pages + Grouping pages

No method alone is good enough. Combine predictions (classify methods)

allows a better result.

Page 45: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

45

Learning to Recognize Relation Instances Assume: Relations are represented by hyper-links

Given the following background relations Class (Page) Link-to(Hyperlink,P1,P2) Has-word (H) – the word is part of the

Hyperlink All-words-capitalized (H) Has-alphanumeric-word (H) – I Teach CS2765 Has-neighborhood-word (H) – Neighborhood=

paragraph

Page 46: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

46

…Learning to Recognize Relation Instances

Try to learn the following Members-of-project(P1,P2) Intsructors_of_course(P1,P2) Department_of_person(P1,P2)

Page 47: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

47

Learned relations instructors of(A,B) :- course(A), person(B), link

to(C,B,A). Test Set: 133 Pos, 5 Neg

department of(A,B) :- person(A), department(B), link to(C,D,A), link to(E,F,D), link to(G,B,F), has neighborhood word graduate(E). Test Set: 371 Pos, 4 Neg

members of project(A,B) :- research project(A), person(B), link to(C,A,D), link to(E,D,B), has neighborhood word people(C). Test Set: 18 Pos, 0 Neg

Page 48: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

48

Accuracy/Coverage tradeoff for learned relation rules

Page 49: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

49

Learning to Extract Text Fields

Sometimes we want a small fragment of text, not the whole web-page or class (like Jon, Peter, etc) Make me hotel and flight arrangements

for the upcoming Linux conference

Page 50: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

50

Predefined predicates

Let F= w1, w2, … wj be a fragment of text length({<,>,=…}, N). some(Var, Path, Feat, Value): some (A,

[next_token, next_token], numeric, true)

position(Var, From, Relop, N): relpos(Var1, Var2, Relop, N):

Page 51: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

51

A wrongExample

ownername(Fragment) :- some(A, [prev token],

word, “gmt"), some(A, [ ], in title, true), some(A, [ ], word,

unknown), some(A, [ ], quadrupletonp,

false) length(<, 3)

Last-Modified: Wednesday, 26-Jun-96 01:37:46 GMT<title>

Bruce Randall Donald

</title><h1><img src="ftp://ftp.cs.cornell.edu/pub/brd/images/brd.gif"><p>Bruce Randall Donald<br>Associate Professor<br>

Page 52: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

52

Accuracy/coverage tradeoff for Name Extraction

Page 53: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

53

Conclusions Used machine learning algorithms to create

information extract methods for each desired type of knowledge.

WebKB achieves 70% accuracy at 30% coverage.

Bag-of-words (Hyperlinks, web-pages and full text) and First order learning can be used to boost the confidence

First order learning can be used to look outward from the page and consider its neighbors

Page 54: Web classification Ontology and Taxonomy. 2 References Using Ontologies to Discover Domain-Level Web Usage Profiles {hdai,mobasher}@cs.depaul.edu Learning

54

Problems Not as accurate as we want

You can get more accuracy at cost of coverage Use linguistic features (verbs) Add new methods to the booster (predict the

department of a professor, based on the department of his students advisees)

A per topic, per language, per … method. Needs hand made labeling to learn. Learners with high accuracy can be used to

teach learners with low accuracy.