1 knowledge discovery on the web georgios paliouras software and knowledge engineering laboratory...

103
1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R. “Demokritos” [email protected] http://www.iit.demokritos.gr/~paliourg

Upload: nelson-jack-horn

Post on 29-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

1

Knowledge discovery on the WebGeorgios Paliouras

Software and Knowledge Engineering LaboratoryInstitute of Informatics and Telecommunications

N.C.S.R. “Demokritos”

[email protected]://www.iit.demokritos.gr/~paliourg

Page 2: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 2

Contents

IntroductionKnowledge discovery from text & linksKnowledge discovery from usage dataKnowledge discovery in actionImportant open issues

Page 3: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 3

Contents

IntroductionInformation overloadThe need for intelligent information accessKnowledge discovery approaches

Knowledge discovery from text & linksKnowledge discovery from usage dataKnowledge discovery in actionImportant open issues

Page 4: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 4

WWW: the new face of the Net

Once upon a time, the Internet was a forum for exchanging information. Then … …came

the Web.The Web introduced new capabilities …

…and attracted many more people …

…increasing commercial interest …

…and turning the Net into a real forum …

Page 5: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 5

Information overload

…as more people started using it ...

…the quantity of information on the Web increased...

…attracting even more people ...

…increasing the quantity of online information further...

…and leading to the overload of information for the users ...

Page 6: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 6

WWW: an expanding forum

The Web is large and volatile:More than 600.000.000 users onlineMore than 800.000 sign up every dayMore than 9.000.000 Web sitesMore than 300.000.000.000 pages onlineLess than 50% of Web sites will be there next year

… leading to the abundance problem:“99% of online information is of no

interest to 99% of the people”

Page 7: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 7

Information access services

A number of services aim to help the user gain access to online information and products ...

… but can they really cope?

Page 8: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 8

New requirements

Current indexing does not allow for wide coverage: Less than 5% of the Web covered by search engines.What I want is hardly ever ranked high enough.Product information in catalogues is often biased towards specific suppliers and outdated.Product descriptions are incomplete and insufficient for comparison purposes.‘E’ in ‘E-commerce’ stands for ‘English’: More than 70% of the Web is English.… and many more problems lead to the conclusion ...

… that more intelligent solutions are needed!

Page 9: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 9

A new generation of services

Some have already made their way to the market…

… many more are being developed as I speak …

Page 10: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 10

Approaches to Web mining

Primary data (Web content):Mainly text,with some multimedia content (increasing)and mark-up commands including hyperlinks.Underlying databases (not directly accessible).

Knowledge discovery from text and linksPattern discovery in unstructured textual data.Pattern discovery in the Web graph / hypertext.

Page 11: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 11

Approaches to Web mining

Secondary data (Web usage):Access logs collected by servers,potentially using cookies,and a variety of navigational information collected by Web clients (mainly JavaScript agents).

Knowledge Discovery from usage dataDiscovery of interesting usage patterns, mainly from server logs.Web personalization & Web intelligence.

Page 12: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 12

ContentsIntroductionKnowledge discovery from text & links

IntroductionInformation filtering and retrievalInformation extraction Ontology learning

Knowledge discovery from usage dataKnowledge discovery in actionImportant open issues

Page 13: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 13

Information access

Goals:Organize documents into categories.Assign new documents to the categories.Retrieve information that matches a user query.

Long history of manually-constructed and statistical document category modelsDominating statistical idea:TFIDF=term frequency * inverse document frequency

Problems on the Web: Huge scale and high volatility demand automation.

Page 14: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 14

Text mining

Knowledge (pattern) discovery in textual data.Clarifying common misconceptions:

Text mining is NOT about assigning documents to thematic categories, but about learning document classifiers. Text mining is NOT about extracting information from text, but about learning information extraction patterns.

Difficulty: unstructured format of textual data.

Page 15: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 15

Approaches to text mining

Combination of language engineering (LE), machine learning (ML) and statistical methods:

LEML-Stats

ML-Stats

LE

Page 16: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 16

Hyperlink information is useful

Information access can be improved by identifying: authoritative pages (authorities) and resource index pages (hubs).Linked pages often contain complementary information (e.g. product offers).Thematically related pages are often linked, either directly or indirectly.

Page 17: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 17

Contents

IntroductionKnowledge discovery from text & links

IntroductionInformation filtering and retrievalInformation extraction Ontology learning

Knowledge discovery from usage dataKnowledge discovery in actionImportant open issues

Page 18: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 18

Document category modelling

Training documents (pre-classified)

Pre-processing

Machine Learning

Category models (classifiers)

Stopword removal (and, the, etc.)Stemming (‘played’ ‘play’)Bag-of-words coding

Statistical selection/combination of characteristic terms (MI, PCA)

Supervised classifier learning

Dimensionality reduction

Page 19: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 19

Document category modelling

Machine Learning methods used:Memory-based learning (k-nearest neighbor).Decision-tree and decision-rule induction.Probabilistic learning (naive Bayes classifiers).Support vector machines.Boosting (combined usually with decision trees).Maximum entropy modeling.Neural networks (multi-layered perceptrons).

Problems: high dimensionality (sparseness), large training sets (scale), overlapping categories.

Page 20: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 20

Document category modelling

Example: Filtering spam email.Task: classify incoming email as spam and legitimate (2 document categories).Simple blacklist and keyword-based methods have failed.More intelligent, adaptive approaches are needed (e.g. naive Bayesian category modeling).

Page 21: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 21

Document category modelling

Step 1 (linguistic pre-processing): Tokenization, removal of stopwords, stemming/lemmatization.Step 2 (vector representation): bag-of-words or n-gram modeling (n=2,3).Step 3 (feature selection): information gain evaluation.Step 4 (machine learning): Bayesian modeling, using word/n-gram frequency.

Page 22: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 22

Link structure analysis

Improve information retrieval by scoring Web pages according to their importance in the Web or a thematic sub-domain of it.Nodes with large fan-in (authorities) provide high quality information.Nodes with large fan-out (hubs) are good starting points.

Page 23: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 23

Link structure analysisThe HITS algorithm [Kleinberg, ACM Journal 1999]:

Given a set of Web pages, e.g. as generated by a query,expand the base set by including pages that are linked to by the ones in the initial set or link to them,assign a hub and an authority weight to each page, initialised to 1,update the authority weight of page p according to the hub weights of the pages that link to it:

update the hub weight of page p according to the authority weights of the pages that it links to:

repeat the weight update for a given number of times,return a list of the pages ranked by their weights.

qpq qp ah|

pqq qp ha|

Page 24: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 24

Link structure analysis

Interesting issues:Does the social network hypothesis hold, i.e., “authorities are highly cited”? This may be unrealistic in competitive commercial domains.What happens if link structure adapts to the method, e.g. unrelated pages link to each other to increase their rating?What about interesting new pages? How will people get to them?

Page 25: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 25

Focused crawling & spidering

Crawling/Spidering: Automatic navigation through the Web by robots with the aim of indexing the Web.Crawling v. Spidering (subjective): inter-site v. intra-site navigation.Focused crawling/spidering: Efficient, thematic indexing of relevant Web pages, e.g. maintenance of a thematic portal.Underlying assumption similar to HITS: thematically similar pages are linked.

Page 26: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 26

Focused crawling & spidering

Focused crawling [Chakrabarti et al., WWW 1999]:Given an initial set of Web pages about a topic, e.g. as found in a Web directory,use document category modelling to build a topic classifier,extract the hyperlinks within the initial set of pages and add them to a queue of pages to be visited,retrieve pages from the queue,use the classifier to assess the relevance of retrieved pages,use a variant of HITS to assign a hub score to pages and the hyperlinks in the queue,re-sort the links in the queue according to their hub score,continue the retrieval of new pages, periodically updating the score of hyperlinks in the queue.

Page 27: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 27

Focused crawling & spidering

Domain-specific spidering:Goal: retrieve interesting pages, without traversing the whole site.Differences from crawling:

The site is much more restricted in size and thematic diversity than the whole of the Web.Social network analysis is less relevant within a site (no hubs and authorities).

Requirement: link scoring using local features, e.g. the anchor text and the textual context.

Page 28: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 28

Focused crawling & spidering

Domain-specific spidering [Rennie & McCallum, ICML 1999]:TRAINING:

Given a set of Web sites, within which interesting pages have been manually identified,use a simplified version of Q-learning (reinforcement learning) to learn a good navigation policy:

Assign Q values to hyperlinks, according to their distance from interesting pages.Learn a model to predict the Q value of a link using local features.

USAGE:Start at the home page of a site.Extract all links, predict their Q values and add them to a queue.Visit the first page in the queue and repeat the process.

Page 29: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 29

Focused crawling & spidering

Interesting issues:Does the social network hypothesis hold, i.e., “related pages link to each other”? This may be unrealistic in competitive commercial domains.How dependant is focused crawling on the initial set of Web pages?What happens with multi-theme hubs?What about interesting new pages? Will they be discovered?What are good descriptive features for link scoring in domain-specific spidering? Can the same features help to improve focused crawling?

Page 30: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 30

ContentsIntroductionKnowledge discovery from text & links

IntroductionInformation filtering and retrievalInformation extraction Ontology learning

Knowledge discovery from usage dataKnowledge discovery in actionImportant open issues

Page 31: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 31

Information extraction

Goals:Identify interesting “events” in unstructured text. Extract information related to the events and store it in structured templates.

Typical application:Information extraction from newsfeeds.

Difficulties:Deals with unstructured or semi-structured text.Identification of entities and relations.Usually requires some understanding of the text.

Page 32: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 32

A typical extraction system

Morphology

Syntax

Semantics

Discourse

Unstructured text and database schema (event templates)Lemmatization (‘said’ ‘say’),Sentence and word separation.Part-of-speech tagging, etc.Shallow syntactic parsing.

Named-entity recognition.Co-reference resolution.Sense disambiguation.Pattern matching.

Structured data (filled templates)

Page 33: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 33

Information extraction

Long history before the birth of the Web.One of the hardest Language Engineering tasks.ARPA Message Understanding Conferences have pushed the field forward.Information overload on the Web has increased the need for IE systems.IE is achievable for very narrow domains (e.g. mergers and acquisitions).Manual construction of IE systems is time-consuming.Machine learning can help solve this problem.

Page 34: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 34

Customization of IE systems

Learning can be used to customize individual modules of an IE system:

Sentence splitting (rule learning).Part-of-speech tagging (Brill tagger).Named-entity recognition (HMMs).Co-reference resolution (rule learning).Word sense disambiguation (rule learning).

Learning speeds up the customization to new domains and new languages!

Page 35: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 35

Extraction pattern discovery

Morphology

Syntax

Semantics

Pattern Discovery

Unstructured text and database schema (event templates)Lemmatization (‘said’ ‘say),Sentence and word separation.Part-of-speech tagging, etc.Shallow syntactic parsing.Named-entity recognition.Co-reference resolution.Sense disambiguation.

IE pattern discovery.

IE patterns

Page 36: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 36

Extraction pattern discovery

Machine Learning methods used:Decision-rule learning.Grammar induction.Co-training (semi-supervised learning).Clustering.

Difficulties: Difficult to produce hand-tagged training data.Extensive use of LE.Lack of sufficient background knowledge.

Page 37: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 37

Wrappers/fact extraction

Simplified information extraction: Extract interesting facts from Web documents. Assumes structure in the documents (usually

dynamically generated from databases). Reduced demand for pre-processing and LE.

Typical application:Product comparison services (price, availability, …).

Difficulties: Semi-structured data. Different underlying database schemata and

presentation formats.

Page 38: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 38

Wrappers/fact extraction<HTML><TITLE> Some Country Codes </TITLE> <BODY><B> Some Country Codes </B> <P><B> Congo </B> <I> 242 </I><B> Egypt </B> <I> 20 </I><B> Greece </B> <I> 30 </I><B> Spain </B> <I> 34 </I><HR> <B> End </B> </BODY> </HTML>

Wrapper (page P) Skip past first occurrence of <P> in P While (next <B> is before next <HR> in P) For each <l, r> { (<B>, </B>}) , (<I>, </I>) } Extract the text between l and r return <country, code > extracted pairs

Country

Code

Congo 242

Egypt 20

Greece 30

Spain 34

Example:

Page 39: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 39

Wrapper induction

Training documents (semi-structured)

Machine Learning Structural/sequence learning

Fact extraction patterns (wrapper)

Data pre-processing Abstraction of mark-up structure (often omitted)

Database schema (interesting facts)

Page 40: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 40

Wrapper inductionSTALKER [Muslea et. al., AAAI 1998]

Sample Page:Name: Taco Bell <br> <p> -LA: 400 Pico; (213) 323-5545, (800) 222-1111. 211 Flower; (213) 424-7645. <p>-Venice: 20 Vernon; (310) 888-1010. <p> <hr>

Features: - Predefined hierarchical structure of the page (EC)

- Single-slot extraction patterns - Landmark-based extraction patterns

Embedded Catalog Tree (EC):Doc ::= Restaurant LIST(City)City ::= CityName LIST(Loc)Loc ::= Number Street LIST(Phone)Phone ::= AreaCode PhoneNumberExtraction patterns:Restaurant: *’Name:’(*)’<br>’ LIST(City) : *’<p>’(*)’<hr>’ City (iteration): *’-‘(*)’<p>’ CityName: *(*)’:’ LIST(Loc): …… etc.

Page 41: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 41

Wrapper induction

Each document is modelled by a stochastic process

A label sequence associated with each training sequence

O = {The software company ITC acquired the …} (Observed)

L = { B B P T S B…} (Hidden)

S1 S3S2 S4

O1 O2 O3

B TP S

O4

HMM states are labelled:B = “background”, P = “prefix”, T = “target”, S = “suffix”

Compute P({S1, ..ST} | {O1, .. OT}). (Viterbi algorithm)

Hidden Markov Models [McCallum et. al., ICML 2000]

Page 42: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 42

Wrapper induction

A separate HMM trained for each fieldHMM structure fixede.g. 1 Background state, 2 Prefix, 2 Suffix,

2 Target

Baum-Welch not needed to train the HMM Parameter values acquired by relative frequencies:aij =

Transition Probabilities

Vvvjc

kjc

)(

)(bj(k) =

Emission Probabilities

( )

( )s S

c i j

c i s

Page 43: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 43

Wrapper induction

Hard issues:How much LE? Traditional IE is a source of ideas.How domain-specific?How unstructured are the data/texts?Generalization to unseen sites?Multiple descriptions per page?Descriptions that span more than one pages?How much labelled data? Can we learn the template? Could active learning help?

Page 44: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 44

ContentsIntroductionKnowledge discovery from text & links

IntroductionInformation filtering and retrievalInformation extraction Ontology learning

Knowledge discovery from usage dataKnowledge discovery in actionImportant open issues

Page 45: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 45

Ontology learning

Training documents (unclassified)

Pre-processing

Dimensionality reduction

Machine Learning

Ontologies

Stopword removal (and, the, etc.)Stemming (‘played’ ‘play’)Syntactic/Semantic analysisBag-of-words coding

Unsupervised learning (clustering and association discovery)

Hand-made thesauri (Wordnet)Term co-occurrence (LSI)

Page 46: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 46

Ontology learning

Hierarchical clustering is most suitable:Agglomerative clusteringConceptual clustering (COBWEB)Model-based clustering (EM-type: MCLUST)

… but flat clustering can also be adapted:K-means and its variantsBayesian clustering (Autoclass)Neural networks (self-organizing maps)

Association discovery (e.g. Apriori) for non-taxonomic relations.

Page 47: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 47

Ontology learning

Example: Acquisition of an ontology for tourist information. [based on Maedche & Staab, ECAI 2000]

Page 48: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 48

Ontology learning

Source data: Web pages of tourist sites.Background knowledge: generic and domain-specific ontologies.Target users: Tourist directories, large travel agencies.Goals:

Identify types of page (e.g. room descriptions) and terms/entities inside pages (e.g. hotel addresses).Identify taxonomic relations between concepts (e.g. accommodation – hotel).Identify non-taxonomic relations between concepts (e.g. accommodation – area).

Page 49: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 49

Ontology learning

Heavy linguistic pre-processing:Syntactic analysis,e.g. verb subcategorization frames:verb(arrive) -> prep(at), dir_obj(Torino).Semantic analysis, e.g. named entity recognition:

‘Via Lagrange’ -> Street namee.g. special dependency relations:

‘Hotel Concord in Torino’

Page 50: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 50

Ontology learning

Various types of clustering methods have been used for document clustering.The ASIUM algorithm [Nédellec, LLL 1999] creates a taxonomy from subcategorization frames, using variabilization, predicate invention, and conceptual clustering.In [Maedche & Staab, ECAI 2000] non-taxonomic relations are discovered using association rule mining on special binary relations of concepts.

Page 51: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 51

Ontology learning

Hard issues:How much LE?How domain-specific?Reliance on existing ontologies.Which non-taxonomic relations?What is an instance (document, paragraph, entity)?Labelling of new concepts and relations.Unlikely to have labelled data.Evaluation.

Page 52: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 52

Contents

IntroductionKnowledge discovery from text & linksKnowledge discovery from usage data

Personalization on the WebData collection and preparation issuesPersonalized assistantsDiscovering generic user modelsSequential pattern discovery

Knowledge discovery in actionImportant open issues

Page 53: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 53

Web personalization

Basic principles [Schwartz, Webonomics, 1997]:

“The Web is ultimately a personal medium in which every user's experience is different than any other's.”

“The quantity of people visiting your site is less important than the quality of their experience.”

Page 54: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 54

Personalized information access

sourcespersonalization

serverreceivers

Page 55: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 55

Personalization v. intelligence

Better service for the user:Reduction of the information overload.More accurate information retrieval and extraction.Recommendation and guidance.

Customer relationship management:Customer segmentation and targeted advertisement.Customer attraction and retention.Service improvement (site structure and content).

Page 56: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 56

User modelling

Basic elements:Constructing models that can be used to adapt the system to the user’s requirements.Different types of requirement: interests (sports and finance news), knowledge level (novice - expert), preferences (no-frame GUI), etc.Different types of model: personal – generic.

Knowledge discovery facilitates the acquisition of user models from data.

Page 57: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 57

User Models

User model (type A): [PERSONAL]User x -> sports, stock market

User model (type B): [PERSONAL]User x, Age 26, Male -> sports, stock market

User community: [GENERIC] Users {x,y,z} -> sports, stock market

User stereotype: [GENERIC]Users {x,y,z}, Age [20..30], Male -> sports, stock

market

Page 58: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 58

Learning user models

Community 1 Community 2 User communities

User 1 User 2 User 3 User 4 User 5

Observation of the users interacting with the system.

User models

Page 59: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 59

Knowledge discovery process

Data collection

Data pre-processing

Pattern discovery

Knowledge post-processing

Collection of usage data by the server and the client.

Data cleaning, user identification, session identification

Construction of user models

Report generation, visualization, personalization module.

Page 60: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 60

Contents

IntroductionKnowledge discovery from text & linksKnowledge discovery from usage data

Personalization on the WebData collection and preparation issuesPersonalized assistantsDiscovering generic user modelsSequential pattern discovery

Knowledge discovery in actionImportant open issues

Page 61: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 61

Usage data sources

Server-side data: access logs in Common or Extended Log Format, user queries, cookies.Client-side data: Java and Javascript agents.Intermediary data: proxy logs, packet sniffers.Registration forms: personal information and preferences supplied by the user.Demographic information: provided by census databases.

Page 62: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 62

Pre-processing usage data

Cleaning: Log entries that correspond to error responses.Trails of robots.Pages that have not been requested explicitly by the user (mainly image files, loaded automatically). Should be domain-specific.

User identification: Identification by log-in.Cookies and Javascript.Extended Log Format (browser and OS version).Bookmark user-specific URL.Various other heuristics.

Page 63: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 63

Pre-processing usage dataUser session/Transaction identification in log files:

Time-based methods, e.g. 30 min silence interval. Problems with cache. Partial solutions: special HTTP headers, Java agents.Context-based methods: e.g. separate pages into navigational and content and impose heuristics on the type of page that a user session may consist of.User sessions can be subdivided into smaller transaction sequences, e.g. by identifying a “backward reference” in the sequence of requests.

Encoding of training data:Bag-of-pages representation of sessions/transactions.Transition-based representation of sessions/transactions.Manually determined features of interest.

Page 64: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 64

Collection and preparation

Problems:Privacy and security issues:

The user must be aware of the data collected.Cookies and client-side agents are often disabled.

Caching on the client or an intermediate proxy causes data loss on the server side.Registration forms are a nuisance and they are not reliable sources.Client-side agents increase response time.User and session identification from server logs is hard.Different data required for different user models.

Page 65: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 65

Contents

IntroductionKnowledge discovery from text & linksKnowledge discovery from usage data

Personalization on the WebData collection and preparation issuesPersonalized assistantsDiscovering generic user modelsSequential pattern discovery

Knowledge discovery in actionImportant open issues

Page 66: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 66

Personalized assistants

Construction of a separate model for each user and use of this model to:

help focus on interesting Web sites (personalized crawling),modify the structure and content of a site,adapt the Web interface.

Various methods developed outside the Web are applicable here, e.g. student modelling.Collection of sufficient usage data is difficult.User identification essential; most often log-in.

Page 67: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 67

Personalized assistants

Personalized crawling [Liebermann et al., ACM Comm.,

2000]:The system knows the user (log-in).It uses heuristics to extract “important” terms from the Web pages that the user visits and add them to thematic profiles.Each time the user views a page, the system:

searches the Web for related pages, filters them according to the relevant thematic profile,and constructs a list of recommended links for the user.

The Letizia version of the system searches the Web locally, following outgoing links from the current page.The Powerscout version uses a search engine to explore the Web.

Page 68: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 68

Personalized assistants

Adapting site structure [Schwarzkopf, UM 2001]:The system knows the user (log-in).Session identification is done by means of bookmarking a personal URL.Bayesian networks define taxonomic relations between topics covered by a Web site.A different network is maintained for each user and the probabilities map the user’s interests.Dynamic hyperlinks and recommendations are derived by the user’s model and included in the site’s home page.

Page 69: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 69

Personalized assistants

Adaptive Web interfaces [Jörding, UM 1999]:The TELLIM system collects user information, (e.g. a selection of a link) using a Java applet .User information is used as training data in order to create generic models reflecting the users’ interest in different products.The system creates short-term personal models using the generic models and the current user’s behavior.Web pages containing more detailed information about these products, together with multimedia content and VRML presentations are created dynamically and presented to the users.

Page 70: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 70

Personalized assistants

Problems:Users suspicious of the agent mishandling their personal data. Especially for server-side systems.Limited amount of training data per user.Usually require heuristics or the user’s involvement, in order to accelerate learning.

Page 71: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 71

Contents

IntroductionKnowledge discovery from text & linksKnowledge discovery from usage data

Personalization on the WebData collection and preparation issuesPersonalized assistantsDiscovering generic user modelsSequential pattern discovery

Knowledge discovery in actionImportant open issues

Page 72: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 72

Generic user models

Stereotypes: Models that represent a type of user, associating personal characteristics with parameters of the system,e.g. Male users of age 20-30 are interested in

sports and politics.

Communities: Models that represent a group of users with common preferences,e.g. Users that are interested in sports and

politics.

Page 73: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 73

Learning user stereotypes

Constructing stereotypes from personal user models:

Given a set of models for the users who have chosen each stereotype,use supervised learning (e.g. decision trees) to construct a model that distinguishes each stereotype from all others.Each user may belong in more than one stereotype.

User sessions may be used instead of personal user models.

Page 74: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 74

Collaborative filtering

Information filtering according to the choices of similar users.Avoids semantic content analysis.Cold-start problem with new users.Approaches:

memory-based learning,model-based clustering,item-based recommendation.

Page 75: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 75

Memory-based learning

Nearest-neighbour approach:Construct a model for each user. Often use explicit user ratings for each item.Index the user in the space of system parameters, e.g. item ratings.For each new user,

index the user in the same space, andfind the k closest neighbours.Simple metrics to measure the similarity between users, e.g. Pearson correlation.

Recommend the items that the new user has not seen and are popular among the neighbours.

Page 76: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 76

Memory-based learning

Finance news

Sport

s n

ew

s

0 1

1

Page 77: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 77

Model-based clustering

Clustering users into communities.Methods used:

Conceptual clustering (COBWEB).Graph-based clustering (Cluster mining).Statistical clustering (Autoclass).Neural Networks (Self-Organising Maps).Model-based clustering (EM-type).BIRCH.

Community models: cluster descriptions.

Page 78: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 78

Model-based clustering

Clique-based clustering:Construct a model for each user.Calculate the similarity between users, in terms of the common items in their models.Construct a graph of users, where the edges are labeled by user similarity.Remove edges according to a similarity threshold.Identify cliques in the graph.

Important: each user may belong in more than one community.

Page 79: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 79

0,50,5

0,10,1

0,80,8

0,90,9

0,90,9

0,40,4

Model-based clustering

Page 80: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 80

Item-based recommendation

Focus on item usage in the profiles, instead of the users themselves.Practically useful in e-commerce, e.g. cross-sell recommendations.Simple modification to the clique-based clustering method: graph of items instead of graph of users.Related to frequent itemset discovery in association rule mining.

Page 81: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 81

0,50,5

0,10,1

0,80,8

0,90,9

0,90,9

0,40,4

Item-based recommendation

Sports

Finance

Politics

World

Page 82: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 82

Generic user models

Issues:Acquisition of personal information for stereotypes.Cold-start for collaborative filtering.Scalability of memory-based collaborative filtering.Cluster analysis for model-based methods.Evaluation measures.Higher-order representation of user sessions.Community-specific item-based filtering.

Page 83: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 83

Contents

IntroductionKnowledge discovery from text & linksKnowledge discovery from usage data

Personalization on the WebData collection and preparation issuesPersonalized assistantsDiscovering generic user modelsSequential pattern discovery

Knowledge discovery in actionImportant open issues

Page 84: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 84

Sequential pattern discovery

Identifying navigational patterns, rather than “bag-of-page” models. Methods:

Clustering transitions between pages.First-order Markov models.Probabilistic grammar induction.Association-rule sequence mining.Path traversal through graphs.

Personal and community navigation models.

Page 85: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 85

Sequential pattern discovery

Clique-based transition clustering; small modification of the model-based item clustering approach: an item is a transition between pages.

0,50,5

0,10,1

0,80,8

0,90,9

0,90,9

0,40,4

Sports-

>Politics

Finance-

>Politics

Sports-

>Finance

Finance->Sports

Page 86: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 86

Sequential pattern discovery

Probabilistic grammar induction [Borges and

Levene, WebKDD 1999]:Represent each session by a string of symbols.Construct a probabilistic automaton, where:

The probability of each state corresponds to the frequency of the corresponding page in the sessions.The edge probabilities correspond to the transition frequencies.

Generate exhaustively all trails that can be generated by the probabilistic grammar, which have a trail probability above a heuristic threshold. Equivalent to ordered frequent itemsets.Higher-order Markovian modelling is also possible.

Page 87: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 87

Sequential pattern discovery

Hard issues:Sequential modelling is more prone to errors in the data collection phase (e.g. broken sequences due to caches).Higher-order modelling is usually computationally intractable.Evaluation measures.

Page 88: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 88

Contents

IntroductionKnowledge discovery from text & linksKnowledge discovery from usage dataKnowledge discovery in action

Software and Knowledge Engineering LabThe CROSSMARC project

Important open issues

Page 89: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 89

SKEL background & interests

Our background:Language Engineering (Information Extraction).User Modeling (for IE and IR systems).Image Analysis.Machine Learning (neural, statistical, symbolic).

Research statement: Reducing the information overload, by

facilitating personalized access to information on the Web.

More at: http://www.iit.demokritos.gr/skel/

Page 90: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 90

Research @ SKEL

Text Categorization and Filtering. Information Extraction. Natural Language Generation.Ontology engineering.Knowledge DiscoveryGrammar InductionUser Modelling / PersonalizationWeb usage mining / Web IntelligenceMultimedia Knowledge Fusion.

Page 91: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 91

Technology @ SKELThe Ellogon text engineering environment (http://www.iit.demokritos.gr/skel/Ellogon/).The PServer generic personalization server.The KOINOTITES Web usage mining environment.The Filterix Web proxy filter for obscene content(http://www.iit.demokritos.gr/skel/i-config/).The Filtron personalized spam filter.(http://www.iit.demokritos.gr/skel/i-config/).Focused crawling and spidering tools.Multilingual information extraction systems.Multilingual natural language generation systems.

Page 92: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 92

Contents

IntroductionKnowledge discovery from text & linksKnowledge discovery from usage dataKnowledge discovery in action

Software and Knowledge Engineering LabThe CROSSMARC project

Important open issues

Page 93: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 93

The CROSSMARC project

FP5 IST project (3/2001 – 9/2003)CROSSMARC: CROSS-lingual Multi-Agent Retail ComparisonObjective: A prototype multi-agent system for cross-lingual retrieval and extraction of information from product description Web pages.

Page 94: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 94

The CROSSMARC consortiumPartner Type Countr

y

NCSR “Demokritos” C ELVeltiNet A.E. P EL

University of Edinburgh P UK

Universita di Roma Tor Vergata P I

Lingway P F

Page 95: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 95

The basic features of CROSSMARC

Innovative information retrieval and extraction technologies for the Web. Cross-lingual, multi-agent open architecture.Customisability to new domains.Personalized interface.

Page 96: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 96

CROSSMARC: Overview

domain ontology

named-entity

recognition and

matching

fact extraction annotated

pages Extraction

interesting

pages

database

XML records

domain-specific

spidering

focused crawling

Retrievalinterestin

g Web sites

Personalized interface

user profiles

Page 97: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 97

Contents

IntroductionKnowledge discovery from text & linksKnowledge discovery from usage dataKnowledge discovery in actionImportant open issues

Multimedia contentLarge scale knowledge discoveryDiscovery in graphsUser privacy issues

Page 98: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 98

Multimedia Web data

Page 99: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 99

Scaling up to unexplored sizes

Until recently machine learning research has considered 10,000 examples a large dataset.On the Web 10,000,000-record databases are not rare.Knowledge discovery algorithms should operate under space and time constraints. The Web is naturally dynamic. Knowledge discovery algorithms should allow incremental refinement of extracted models.

Page 100: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 100

Discovery in the Web graph

The Web is a graph and Web sites are subgraphs.Many resources on the Web have a graphical or hierarchical structure, e.g. Web directories.Knowledge discovery algorithms should be aware of the graphical structure.Discovery algorithms for structured and graphical data can lead to new interesting applications.

Page 101: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 101

Respecting the user’s privacy

Data collection and use should be transparent to the user.Careless use of personal data will (at the best) scare users off.Navigational data is personal, when associated with an individual.“Unobtrusive personalization” should be exercised with cautiousness.Technology can help safeguarding privacy.

Page 102: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

© Georgios Paliouras (March 2003) 102

References

J. Borges and M. Levene, Data mining of user navigation patterns. Proceedings of Workshop on Web Usage Analysis and User Profiling (WEBKDD), in conjunction with ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, CA., pp. 31-36.S. Chakrabarti, M. H. van den Berg, B. E. Dom, Focused Crawling: a new approach to topic-specific Web resource discovery, Proceedings of the Eighth International World Wide Web Conference (WWW), Toronto, Canada, May 1999.T. Jörding, T, A Temporary User Modeling Approach for Adaptive Shopping on the We`, In Proceedings of the 2nd Workshop on Adaptive Systems and User Modeling on the WWW, UM'99, Banff, Canada, 1999.J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, v. 46, 1999. H. Lieberman, C. Fry and L. Weitzman. Exploring the Web with Reconnaissance Agents, Communications of the ACM, August 2001, pp. 69-75.A. Maedche, S. Staab. Discovering Conceptual Relations from Text. In: W.Horn (ed.): ECAI 2000. Proceedings of the 14th European Conference on Artificial Intelligence (ECAI), Berlin, August 21-25, 2000. A. McCallum, D. Freitag and F. Pereira, Maximum Entropy Markov Models for Information Extraction and Segmentation, Proceedings of the International Conference on Machine Learning (ICML), Stanford, CA, 2000, pp. 591-598.I. Muslea , S. Minton and C. Knoblock , STALKER: Learning extraction rules for semistructured Web-based information sources. Proceedings of the National Conference on Artificial Intelligence (AAAI), Madison, Wisconsin, 1998. C. Nédellec, Corpus-based learning of semantic relations by the ILP system, Asium, Learning Language in Logic, Cussens J. and Dzeroski S. (Eds.), Springer Verlag, September 2000. J. Rennie and A. McCallum. Efficient Web Spidering with Reinforcement Learning. Proceedings of the International Conference on Machine Learning (ICML), 1999.E. I. Schwartz. Webonomics. New York: Broadway books, 1997.E. Schwarzkopf, An adaptive Web site for the UM2001 conference. Proceedings of the Workshop on Machine Learning for User Modeling, in conjunction with the International Conference on User modelling (UM), pp 77-86, 2001.

Page 103: 1 Knowledge discovery on the Web Georgios Paliouras Software and Knowledge Engineering Laboratory Institute of Informatics and Telecommunications N.C.S.R

103

Knowledge discovery on the WebGeorgios Paliouras

Software and Knowledge Engineering LaboratoryInstitute of Informatics and Telecommunications

N.C.S.R. “Demokritos”

[email protected]://www.iit.demokritos.gr/~paliourg