ncsr “demokritos” institute of informatics & telecommunications knowledge discovery on the...

82
NCSR “Demokritos” NCSR “Demokritos” Institute of Informatics & Telecommunications Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Georgios Paliouras Email: Email: paliourg paliourg @ @ iit iit . . demokritos demokritos . . gr gr WWW: WWW: http://www.iit.demokritos.gr/~paliourg http://www.iit.demokritos.gr/~paliourg HERMES seminar, Februrary 17, 2001 HERMES seminar, Februrary 17, 2001

Upload: avice-barton

Post on 25-Dec-2015

220 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

NCSR “Demokritos”NCSR “Demokritos”

Institute of Informatics & TelecommunicationsInstitute of Informatics & Telecommunications

Knowledge discovery on the Web

Georgios PaliourasGeorgios Paliouras

Email: Email: paliourgpaliourg@@iitiit..demokritosdemokritos..grgr

WWW: WWW: http://www.iit.demokritos.gr/~paliourghttp://www.iit.demokritos.gr/~paliourg

HERMES seminar, Februrary 17, 2001HERMES seminar, Februrary 17, 2001

Page 2: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 2

A Short Story About the Web: where we are and how we

reached here…

Page 3: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 3

WWW: the new face of the Net

Once upon a time, the Internet was a forum for exchanging information. Then … …came

the Web.The Web introduced new capabilities …

…and attracted many more people …

…increasing commercial interest …

…and turning the Net into a real forum …

Page 4: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 4

Information sources

Providing information is still the main service …

Commercial Non-Commercial

CNN Reuters

Times Yahoo

CORDIS NCSTRL

MLNET Library

Page 5: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 5

Information overload

…as more people started using it ...

…the quantity of information on the Web increased...

…attracting even more people ...

…increasing the quantity of online information further...

…and leading to the overload of information for the users ...

Page 6: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 6

WWW: an expanding forum

The Web is large and expanding:The Web is large and expanding: 100.000100.000 people sign up every day people sign up every day about about 300.000.000300.000.000 people online people online 1.000.0001.000.000 new pages added every day new pages added every day 600 GB600 GB of pages change every month of pages change every month

… … leading to the leading to the abundance problemabundance problem::“99% of online information is of no

interest to 99% of the people”

Page 7: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 7

Information access services

A number of services aim to help the user gain A number of services aim to help the user gain access to online information and products ...access to online information and products ...

… … but can they really cope?but can they really cope?

Page 8: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 8

New requirements

Manual indexing does not allow for wide coverage: Manual indexing does not allow for wide coverage: 18% of the Web covered by the largest engines.18% of the Web covered by the largest engines.

What I want is hardly ever ranked high enough.What I want is hardly ever ranked high enough. Product information is often biased towards specific Product information is often biased towards specific

suppliers and outdated.suppliers and outdated. Product descriptions are incomplete and insufficient Product descriptions are incomplete and insufficient

for comparison purposes.for comparison purposes. ‘‘E’ in ‘E-commerce’ stands for ‘English’.E’ in ‘E-commerce’ stands for ‘English’. … … and many more problems lead to the conclusion ...and many more problems lead to the conclusion ...

… … that more intelligent solutions are needed!that more intelligent solutions are needed!

Page 9: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 9

A new generation of services

Some have already made their way to the market…Some have already made their way to the market…

… … many more are being developed as I speakmany more are being developed as I speak … …

Page 10: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 10

Data mining

Problem understandin

gData selection

and pre-processing

Machine

Learning Knowledge

Post-processing

and evaluationDeployment

technology loop

application loop

Page 11: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 11

Data on the Web

Primary data (Web content):Primary data (Web content): Mainly Mainly text,text, with some multimedia contentwith some multimedia content and mark-up commands.and mark-up commands. Underlying databases (not directly accessible).Underlying databases (not directly accessible).

Secondary data (Web usage):Secondary data (Web usage): Access Access logslogs collected by a Web server collected by a Web server and a variety of navigational information collected and a variety of navigational information collected

by Web clients.by Web clients.

Page 12: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 12

Approaches to Web mining

Web content miningWeb content mining Pattern discovery in Web content data.Pattern discovery in Web content data. Mainly Mainly mining unstructured textual datamining unstructured textual data..

Web structure miningWeb structure mining Pattern discovery in Pattern discovery in the Web graphthe Web graph.. The graph is defined by the hyperlinks.The graph is defined by the hyperlinks.

Web usage miningWeb usage mining Discovery of interesting Discovery of interesting usage patternsusage patterns.. Mainly in server logs.Mainly in server logs.

Page 13: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 13

Web Content Mining

Page 14: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 14

Web content mining tasks

Information AccessInformation Access Document category modellingDocument category modelling.. Construction of Construction of thematic hierarchiesthematic hierarchies..

Fact ExtractionFact Extraction Extraction of product informationExtraction of product information, presented in , presented in different formats.different formats.

Information ExtractionInformation Extraction Structured “event” summariesStructured “event” summaries from large textual from large textual corpora.corpora.

Page 15: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 15

Text mining

Knowledge (pattern) discovery in textual data.Knowledge (pattern) discovery in textual data. Clarifying common misconceptions:Clarifying common misconceptions:

Text mining is NOT about assigning documents to Text mining is NOT about assigning documents to thematic categories, but about thematic categories, but about learning document learning document classifiersclassifiers. .

Text mining is NOT about extracting information Text mining is NOT about extracting information from text, but about from text, but about learning information extraction learning information extraction patternspatterns..

Difficulty: unstructured format of textual data.Difficulty: unstructured format of textual data.

Page 16: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 16

Underlying technology

Combination of language engineering (LE), Combination of language engineering (LE), machine learning (ML) and statistical methods:machine learning (ML) and statistical methods:

LEML-Stats

ML-Stats

LE

Page 17: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 17

Information access

Goals:Goals: Organize documents into categories.Organize documents into categories. Assign new documents to the categories.Assign new documents to the categories. Retrieve documents that match a user query.Retrieve documents that match a user query.

Long history, of manually-constructed and Long history, of manually-constructed and statistical document category models.statistical document category models.

Dominating statistical idea:Dominating statistical idea:TFIDF=term frequency * inverse document frequencyTFIDF=term frequency * inverse document frequency

Page 18: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 18

Information access on the Web

Problems with traditional IA methods:Problems with traditional IA methods: Large volume of dataLarge volume of data: document category models : document category models

are hard to construct and maintain manually.are hard to construct and maintain manually. Highly dynamic environmentHighly dynamic environment: thematic hierarchies : thematic hierarchies

change continuously.change continuously. Web content mining solutions:Web content mining solutions:

Automated construction and maintenance of Automated construction and maintenance of category models.category models.

Automated construction of ontologies (thematic Automated construction of ontologies (thematic hierarchies).hierarchies).

Page 19: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 19

Document category modeling

Training documents (pre-classified)

Pre-processing

Feature selection

Machine Learning

Category models (classifiers)

Stopword removal (and, the, etc.)Stemming (‘played’ ‘play’)Bag-of-words coding

Statistical selection of characteristic terms (mutual information)

Supervised classifier learning

Page 20: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 20

Document category modeling

Machine Learning methods used:Machine Learning methods used: Memory-based learning (k-nearest neighbor).Memory-based learning (k-nearest neighbor). Decision-tree and decision-rule induction.Decision-tree and decision-rule induction. Bayesian learning (naïve Bayes classifiers).Bayesian learning (naïve Bayes classifiers). Support vector machines.Support vector machines. BoostingBoosting (combined usually (combined usually with decision treeswith decision trees).). Maximum entropy modeling.Maximum entropy modeling. Neural networks (multi-layered perceptrons).Neural networks (multi-layered perceptrons).

Problems: Problems: high dimensionalityhigh dimensionality, , large training large training setssets, , overlapping categoriesoverlapping categories..

Page 21: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 21

Ontology/Taxonomy construction

Training documents (unclassified)

Pre-processing

Feature compression

Machine Learning

Thematic hierarchies, including category models (classifiers)

Stopword removal (and, the, etc.)Stemming (‘played’ ‘play’)Bag-of-words coding

Unsupervised learning (clustering)

Hand-made thesauri (Wordnet)Term co-occurrence (LSI)

Page 22: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 22

Ontology/Taxonomy construction

Hierarchical clusteringHierarchical clustering is most suitable: is most suitable: Agglomerative clusteringAgglomerative clustering Conceptual clustering (COBWEB)Conceptual clustering (COBWEB) Model-based clustering (EM-type: MCLUST)Model-based clustering (EM-type: MCLUST)

… … but but flat clusteringflat clustering can also be adapted: can also be adapted: K-means and its variantsK-means and its variants Bayesian clustering (Autoclass)Bayesian clustering (Autoclass) Neural networks (self-organizing maps)Neural networks (self-organizing maps)

Feature compression makes the difference!Feature compression makes the difference!

Page 23: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 23

Fact extraction Goal: Goal:

Extract interesting facts from Web documents.Extract interesting facts from Web documents. Typical application:Typical application:

Product comparison servicesProduct comparison services (price, availability, …). (price, availability, …). Difficulties:Difficulties:

Semi-structured data.Semi-structured data. Different database schemata and presentation Different database schemata and presentation formats.formats.

Common approach: Common approach: Manual construction of wrappers.Manual construction of wrappers.

Page 24: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 24

Wrapper induction

Training documents (semi-structured)

Machine Learning Structural/sequence learning

Fact extraction patterns (wrapper)

Data pre-processing Abstraction of mark-up structure

Database schema (interesting facts)

Page 25: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 25

Wrapper induction

Machine Learning methods used:Machine Learning methods used: Grammar (Finite State Automata) induction.Grammar (Finite State Automata) induction. Hidden Markov Models.Hidden Markov Models. Programming by demonstration.Programming by demonstration. Inductive logic programming.Inductive logic programming. Proprietary algorithmsProprietary algorithms (HLRT, Shopbot learner). (HLRT, Shopbot learner).

Simple heuristics can make the learning task Simple heuristics can make the learning task much easier!much easier!

Page 26: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 26

Schema extraction

Training documents (semi-structured)

Machine Learning Structural/relational learning

Abstract database schemata and mappings (DataGuides)

Data pre-processing Declarative, relational, graphical data representation (OEM, XML, …)

Page 27: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 27

Schema extraction

Machine Learning methods used:Machine Learning methods used: Inductive logic programming.Inductive logic programming. Association rule induction.Association rule induction. Clustering.Clustering. Many proprietary methods!Many proprietary methods!

There is a need for efficient learning from There is a need for efficient learning from structured and graphical data!structured and graphical data!

Page 28: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 28

Information extraction Goals:Goals:

Identify interesting “events” in unstructured text. Identify interesting “events” in unstructured text. Extract information related to the events and store Extract information related to the events and store

it in structured templates.it in structured templates. Typical application:Typical application:

Information extraction from newsfeeds.Information extraction from newsfeeds. Difficulties:Difficulties:

Deals with unstructured text.Deals with unstructured text. Extracts complex events, rather than simple facts.Extracts complex events, rather than simple facts. Usually requires Usually requires deep understanding of the textdeep understanding of the text..

Page 29: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 29

A typical extraction system

Morphology

Syntax

Semantics

Discourse

Unstructured text and database schema (event templates)Lemmatization (‘said’ ‘say’),Sentence and word separation.Part-of-speech tagging, etc.Shallow syntactic parsing.Named-entity recognition.Co-reference resolution.Sense disambiguation.

IE pattern matching.

Structured data (filled templates)

Page 30: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 30

Information extraction

Long history before the birth of the Web.Long history before the birth of the Web. One of the hardest Language Engineering tasks.One of the hardest Language Engineering tasks. ARPA Message Understanding Conferences have ARPA Message Understanding Conferences have

pushed the field forward.pushed the field forward. Information overload on the Web has increased the Information overload on the Web has increased the

need for IE systems.need for IE systems. IE is achievable for very narrow domainsIE is achievable for very narrow domains

(e.g. mergers and acquisitions).(e.g. mergers and acquisitions). Manual construction of IE systems is time-consuming.Manual construction of IE systems is time-consuming. Machine learning can help solve this problem.Machine learning can help solve this problem.

Page 31: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 31

Extraction pattern discovery

Morphology

Syntax

Semantics

Pattern Discovery

Unstructured text and database schema (event templates)Lemmatization (‘said’ ‘say),Sentence and word separation.Part-of-speech tagging, etc.Shallow syntactic parsing.Named-entity recognition.Co-reference resolution.Sense disambiguation.

IE pattern discovery.

IE patterns

Page 32: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 32

Extraction pattern discovery

Machine Learning methods used:Machine Learning methods used: Decision-rule learning.Decision-rule learning. Grammar induction.Grammar induction. Co-training (semi-supervised learning).Co-training (semi-supervised learning). Clustering.Clustering.

Linguistic resourcesLinguistic resources are also needed. are also needed. Difficulties: Difficulties:

Difficult to produce hand-tagged training dataDifficult to produce hand-tagged training data.. Lack of sufficient background knowledge.Lack of sufficient background knowledge.

Page 33: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 33

Customization of IE systems

Learning can be used to customize individual Learning can be used to customize individual modules of an IE system:modules of an IE system: Sentence splitting (rule learning).Sentence splitting (rule learning). Part-of-speech tagging (Brill tagger).Part-of-speech tagging (Brill tagger). Named-entity recognition (HMMs).Named-entity recognition (HMMs). Co-reference resolution (rule learning).Co-reference resolution (rule learning). Word sense disambiguation (rule learning).Word sense disambiguation (rule learning).

Learning speeds up the customization to new Learning speeds up the customization to new domains and new languages!domains and new languages!

Page 34: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 34

Multimedia content Most of the Web’s content is still text.Most of the Web’s content is still text. The quantity of multimedia content (image, The quantity of multimedia content (image,

sound, video) is increasing.sound, video) is increasing. Some sites are dominated by non-textual Some sites are dominated by non-textual

content (digital museums, maps).content (digital museums, maps). Some textual content is presented in images Some textual content is presented in images

(e.g. advertisement banners).(e.g. advertisement banners). Web content mining should include Web content mining should include

multimedia content.multimedia content.

Page 35: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 35

Mining multimedia data

Multimedia information retrieval is really about Multimedia information retrieval is really about image retrieval, using overly simplistic image retrieval, using overly simplistic methods (e.g. color histogram matching).methods (e.g. color histogram matching).

Multimedia information extraction has not Multimedia information extraction has not been used on the Web.been used on the Web.

Learning can assist in Learning can assist in constructing complex constructing complex multimedia IR and IE systemsmultimedia IR and IE systems..

Learning can also be used to Learning can also be used to combine data in combine data in different modalitiesdifferent modalities..

Page 36: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 36

Open issues…

Scalable learning methods.Scalable learning methods. Dimensionality reduction.Dimensionality reduction. Semi-supervised learning.Semi-supervised learning. Learning from structured and graphical data.Learning from structured and graphical data. Word/Phrase sense disambiguation.Word/Phrase sense disambiguation. Customization of IE systems.Customization of IE systems. Multimedia content mining.Multimedia content mining.

Page 37: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 37

Web Structure Mining

Page 38: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 38

Hyperlink information is useful

Information retrieval can be improved by:Information retrieval can be improved by: Identifying Identifying authoritative pagesauthoritative pages.. Identifying Identifying resource index pagesresource index pages.. Summarizing common references.Summarizing common references.

Linked pages often contain Linked pages often contain complementary complementary informationinformation (e.g. product offers). (e.g. product offers).

Structural analysis of a Web siteStructural analysis of a Web site facilitates its facilitates its improvement.improvement.

Page 39: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 39

Improved information retrieval

Social network analysisSocial network analysis:: Nodes with large fan-in (Nodes with large fan-in (authoritiesauthorities) provide high ) provide high

quality information.quality information. Nodes with large fan-out to authorities (Nodes with large fan-out to authorities (hubshubs) are ) are

good starting points.good starting points.

Disconnected subgraphs correspond to different Disconnected subgraphs correspond to different social (e.g. research) communities.social (e.g. research) communities.

Page 40: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 40

Multi-page fact extraction

Interesting facts often span more than one Interesting facts often span more than one pages, usually hyperlinked.pages, usually hyperlinked.

Fact extraction wrappers need to include Fact extraction wrappers need to include patterns for processing hyperlinked pages.patterns for processing hyperlinked pages.

Structure-aware wrapper inductionStructure-aware wrapper induction discovers discovers patterns that relate pages.patterns that relate pages.

Page 41: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 41

Web site design and optimization

A Web site can be represented as a graph.A Web site can be represented as a graph. The graph of a site contains different types of The graph of a site contains different types of

links: crosswise, upward, downward, outward.links: crosswise, upward, downward, outward. Substructures of a Web graph can reveal Substructures of a Web graph can reveal

problematic areas: missing links, unwanted problematic areas: missing links, unwanted connections and loops, etc.connections and loops, etc.

Combination with usage data is interesting.Combination with usage data is interesting.

Page 42: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 42

Open issues…

Web structure mining is an emerging Web structure mining is an emerging research field.research field.

The work so far focuses on small subgraphs The work so far focuses on small subgraphs (a node and its adjacent neighbors).(a node and its adjacent neighbors).

Structure-aware Web (content and usage) Structure-aware Web (content and usage) mining.mining.

Hyperlink classification.Hyperlink classification. Handling of dynamically generated pages.Handling of dynamically generated pages.

Page 43: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 43

Web Usage Mining

Page 44: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 44

““The Quantity of People Visiting Your The Quantity of People Visiting Your Site Is Less Important Than the Site Is Less Important Than the

Quality of Their Experience”Quality of Their Experience”

Evan I. Schwartz,Evan I. Schwartz, Webonomics, Broadway Books, 1997Webonomics, Broadway Books, 1997

Page 45: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 45

Personalized information access

sources

server

receivers

Page 46: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 46

Motivation for personalization

Better service for the user:Better service for the user: Reduction of the information overload.Reduction of the information overload. More accurate information retrieval and extraction. More accurate information retrieval and extraction.

Customer relationship management:Customer relationship management: Customer segmentation and targeted Customer segmentation and targeted

advertisement.advertisement. Customer attraction and retention strategy.Customer attraction and retention strategy. Service improvement (site structure and content).Service improvement (site structure and content).

Page 47: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 47

Underlying technology

User modeling:User modeling: Constructing models that can be used to adapt the Constructing models that can be used to adapt the

system to the user’s requirements.system to the user’s requirements. Different types of requirement: Different types of requirement: interestsinterests (sports (sports

and finance news), and finance news), knowledge levelknowledge level (novice - (novice - expert), expert), preferencespreferences (no-frame GUI), etc. (no-frame GUI), etc.

Different types of model: Different types of model: personalpersonal – – genericgeneric..

Machine learning and statistics facilitate the Machine learning and statistics facilitate the acquisition of user models.acquisition of user models.

Page 48: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 48

User Models

User model (type A): User model (type A): [PERSONAL][PERSONAL]

User x -> User x -> sports, stock market sports, stock market

User model (type B):User model (type B): [PERSONAL][PERSONAL]

User x, Age 26, Male -> User x, Age 26, Male -> sports, stock market sports, stock market

User community:User community: [GENERIC][GENERIC]

Users {x,y,z} -> Users {x,y,z} -> sports, stock market sports, stock market

User stereotype:User stereotype: [GENERIC][GENERIC]

Users {x,y,z}, Age [20..30], Male -> Users {x,y,z}, Age [20..30], Male -> sports, stock marketsports, stock market

Page 49: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 49

Learning user models

Community 1 Community 2 User communities

User 1 User 2 User 3 User 4 User 5

Observation of the users interacting with the system.

User models

Page 50: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 50

Web usage mining process

Data collection

Data pre-processing

Pattern discovery

Knowledge post-processing

Collection of usage data by the server and the client.

Data cleaning, user identification, session identification

Construction of user models

Report generation, visualization, personalization module.

Page 51: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 51

Usage data sources

Server-side data:Server-side data: access logs in Common or access logs in Common or Extended Log Format, user queries, cookie Extended Log Format, user queries, cookie tracks, packet sniffers.tracks, packet sniffers.

Client-side data:Client-side data: Java and Javascript agents. Java and Javascript agents. Registration forms:Registration forms: personal information personal information

supplied by the user.supplied by the user. Demographic information:Demographic information: provided by census provided by census

databases.databases.

Page 52: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 52

Problems in data collection

Privacy and security issues:Privacy and security issues: The user must be aware of the data collected.The user must be aware of the data collected. Cookies and client-side agents are often disabled.Cookies and client-side agents are often disabled.

CachingCaching on the client or an intermediate on the client or an intermediate proxy causes data loss on the server side.proxy causes data loss on the server side.

Registration formsRegistration forms are a nuisance and they are a nuisance and they are not reliable sources.are not reliable sources.

Client-side agentsClient-side agents increase response time. increase response time.

Page 53: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 53

Pre-processing usage data

Cleaning:Cleaning: removing pages that have not been removing pages that have not been requested explicitly by the user (mainly requested explicitly by the user (mainly multimedia files, loaded automatically). multimedia files, loaded automatically). Should be domain-specific.Should be domain-specific.

User identification:User identification: difficult when server log difficult when server log data are used (only IP information available).data are used (only IP information available).

User-session/transaction identification:User-session/transaction identification: difficult when the same IP is used by many difficult when the same IP is used by many users. Upper limit on transition time (30 min).users. Upper limit on transition time (30 min).

Page 54: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 54

Aggregate data mining

Discovery of interesting aggregate patterns:Discovery of interesting aggregate patterns: Classification of customers (loyal – casual).Classification of customers (loyal – casual). Market basket analysis.Market basket analysis. Time-series analysis.Time-series analysis.

Supplementary sales data are needed.Supplementary sales data are needed. Methods:Methods:

Supervised learning.Supervised learning. Association rules.Association rules. Statistical and visual analysis.Statistical and visual analysis.

Page 55: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 55

Constructing personal models

Applications:Applications: Personal Web browsing assistant.Personal Web browsing assistant. Personalized site structure.Personalized site structure. Personalized interface.Personalized interface.

Requires user registration and often user Requires user registration and often user feedback.feedback.

Methods: Various types of supervised Methods: Various types of supervised learning.learning.

Page 56: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 56

Collaborative filtering

Information filtering according to the choices Information filtering according to the choices of similar users.of similar users.

Avoids semantic content analysis.Avoids semantic content analysis. Cold-start problemCold-start problem with new users. with new users. Methods: Methods:

Primarily Primarily memory-based learningmemory-based learning, (e.g. k-nn)., (e.g. k-nn).No user models constructed.No user models constructed.

Recently Recently model-based clusteringmodel-based clustering..

Page 57: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 57

Collaborative filtering

Finance news

Sp

ort

s n

ew

s

0 1

1

Page 58: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 58

Community models

ClusteringClustering users into communities. users into communities. Methods used:Methods used:

Conceptual clustering (COBWEB).Conceptual clustering (COBWEB). Graph-based clustering (Cluster mining).Graph-based clustering (Cluster mining). Statistical clustering (Autoclass).Statistical clustering (Autoclass). Neural Networks (Self-organising Maps).Neural Networks (Self-organising Maps). Model-based clustering (EM-type).Model-based clustering (EM-type). BIRCH.BIRCH.

Community models:Community models: cluster descriptions. cluster descriptions.

Page 59: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 59

0,50,5

0,10,1

0,80,8

0,90,9

0,90,9

0,40,4

Community models

Page 60: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 60

Navigational pattern discovery

Identifying navigational patterns, rather than Identifying navigational patterns, rather than “bag-of-page” models. “bag-of-page” models.

Methods:Methods: Clustering transitions between pages.Clustering transitions between pages. First-order Markov models.First-order Markov models. Probabilistic grammar induction.Probabilistic grammar induction. Association-rule sequence mining.Association-rule sequence mining. Path traversal through graphs.Path traversal through graphs.

Personal and community navigation models.Personal and community navigation models.

Page 61: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 61

Knowledge post-processing

Identifying interesting patterns: Identifying interesting patterns: heuristicheuristic measures of interestingness.measures of interestingness.

Report generation:Report generation: basic statistics, group basic statistics, group statistics, path analysis, etc.statistics, path analysis, etc.

Personalization:Personalization: separate on-line module, separate on-line module, using the models generated by data mining.using the models generated by data mining.

Site optimization:Site optimization: adaptive Web sites, adaptive Web sites, introducing index pages.introducing index pages.

Page 62: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 62

Open issues…

Technical problems in data collection and Technical problems in data collection and pre-processing.pre-processing.

Privacy and security issues.Privacy and security issues. Sequential and graphical data mining.Sequential and graphical data mining. Efficient mining for large datasets.Efficient mining for large datasets. Combination with content and structure data.Combination with content and structure data. Further incorporation of user modeling ideas.Further incorporation of user modeling ideas.

Page 63: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 63

Web mining in action:projects in I.I.&T., NCSR

“Demokritos”

Page 64: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 64

Background & interests

Our background:Our background: Language Engineering (Information Extraction).Language Engineering (Information Extraction). User Modeling (for IE and IR systems).User Modeling (for IE and IR systems). Image Analysis.Image Analysis. Machine Learning (neural, statistical, symbolic).Machine Learning (neural, statistical, symbolic).

Research statement:Research statement: Reducing the information overload, by facilitating Reducing the information overload, by facilitating

personalized access to information on the Web.personalized access to information on the Web.

Page 65: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 65

ECRAN

Objective:Objective:Customization of information extraction modules to Customization of information extraction modules to new domains and languages.new domains and languages.

Partners: Partners: Thomson-CSF (FR), University of Ancona (IT), Thomson-CSF (FR), University of Ancona (IT), University of Rome "Tor Vergata" (IT), Smart University of Rome "Tor Vergata" (IT), Smart Information Services GmbH (DE), NCSR Information Services GmbH (DE), NCSR "Demokritos" (GR), University of Friburg (CH), "Demokritos" (GR), University of Friburg (CH), University of Sheffield (UK)University of Sheffield (UK)

Ended: Ended: February 1999.February 1999.

Page 66: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 66

Web mining in ECRAN

Methods for Methods for customizing an IE systemcustomizing an IE system:: Supervised learning of Supervised learning of named-entity recognizersnamed-entity recognizers.. Supervised learning for Supervised learning for word-sense word-sense

disambiguationdisambiguation.. Unsupervised learning of Unsupervised learning of extraction patternsextraction patterns. .

PersonalizingPersonalizing an IE system: an IE system: Adaptive personal user models.Adaptive personal user models. Supervised learning of user stereotypes.Supervised learning of user stereotypes.

Page 67: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 67

UMIE

Page 68: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 68

MITOS Objectives:Objectives:

Personalized Information Retrieval and Extraction Personalized Information Retrieval and Extraction from Greek financial news articles.from Greek financial news articles. Discovery of interesting patterns, combining Discovery of interesting patterns, combining published events and stock exchange data.published events and stock exchange data.

Partners:Partners:NCSR "Demokritos", Athens University of NCSR "Demokritos", Athens University of Economics and Business, University of Peireas, Economics and Business, University of Peireas, University of Patras, Knowledge S.A., SENA, University of Patras, Knowledge S.A., SENA, KAPA-TEL S.A.KAPA-TEL S.A.

Ending: Ending: April 2001April 2001

Page 69: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 69

Web mining in MITOS Supervised learning for Supervised learning for document document

classificationclassification, used in Information Retrieval., used in Information Retrieval. Supervised learning to construct Supervised learning to construct an IE system an IE system

for Greekfor Greek: sentence splitting, part-of-speech : sentence splitting, part-of-speech tagging, named-entity recognition, extraction tagging, named-entity recognition, extraction pattern discovery.pattern discovery.

Supervised learning of Supervised learning of personal user modelspersonal user models and unsupervised learning of and unsupervised learning of communitiescommunities..

Pattern discovery in Pattern discovery in data extracted from text data extracted from text and stock exchange dataand stock exchange data..

Page 70: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 70

Personalization in MITOS

Personal models

Communities

Clustering

Collaborative recommendation

Page 71: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 71

M-PIRO

Objective:Objective: Personalized descriptions of digital objects Personalized descriptions of digital objects (primarily museum exhibits).(primarily museum exhibits).

Partners:Partners:Edinburgh University (UK), System Simulation Ltd Edinburgh University (UK), System Simulation Ltd (UK), NCSR "Demokritos” (GR), University of (UK), NCSR "Demokritos” (GR), University of Athens (GR), Foundation of the Hellenic World Athens (GR), Foundation of the Hellenic World (GR), Institute of Research in Science and (GR), Institute of Research in Science and Technology (IT).Technology (IT).

Ending: Ending: February 2003February 2003

Page 72: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 72

Personalization in M-PIRO

Knowledge-level modeling and adaptation of Knowledge-level modeling and adaptation of object descriptions.object descriptions.

Modeling of short-term and long-term Modeling of short-term and long-term visit visit historyhistory..

Community modelingCommunity modeling for collaborative for collaborative recommendation and adaptive object recommendation and adaptive object indexing.indexing.

Page 73: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 73

M-PIRO

Dynamic text generationThe temple of Athena is approximately 3 centuries older than the Stadium, the previous building that you visited. It was built during

the early 5th century BC, on the peninsula south of the Port Theatre. Prehistoric ruins were also found in the surrounding area. You will probably find them interesting.

Image property of the Foundation of Hellenic World.

Page 74: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 74

CROSSMARC

Objective:Objective:Personalized, cross-lingual fact extraction for retail Personalized, cross-lingual fact extraction for retail product comparison.product comparison.

Partners:Partners:NCSR “Demokritos” (GR), NCSR “Demokritos” (GR), VeltiNet AE (GR),VeltiNet AE (GR),

Univ. of Edinburgh (UK), Univ. Roma, Tor Vergata Univ. of Edinburgh (UK), Univ. Roma, Tor Vergata (IT), (IT), Informatique CDC (FR), Internet Commerce Informatique CDC (FR), Internet Commerce Network (FR)Network (FR)

Starting: Starting: March 2001March 2001

Page 75: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 75

Web mining in CROSSMARC

Wrapper inductionWrapper induction and and schema extractionschema extraction.. Customization of Customization of information extractioninformation extraction

modules to new domains and languages.modules to new domains and languages. PersonalizationPersonalization through personal and through personal and

community models.community models. Web usage mining for Web usage mining for Customer Relationship Customer Relationship

ManagementManagement..

Page 76: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 76

Conclusions:where do we go from here?

Page 77: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 77

A paradox of the new era

High commercial demand for High commercial demand for research products!research products!

Solutions need to be simple and Solutions need to be simple and efficient!efficient!

Page 78: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 78

Really useful Web mining!

Site optimization

Collaborative filtering

Really useful Web mining

Usage miningUsage mining

Authoritative filtering

Structure miningStructure mining

Content miningContent mining

Page 79: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 79

Mining multimedia Web data

Page 80: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 80

Scaling up to unexplored sizes Until recently machine learning research has Until recently machine learning research has

considered 10,000 examples a large dataset.considered 10,000 examples a large dataset. On the Web On the Web 10,000,000-record databases10,000,000-record databases

are not rare.are not rare. Web mining algorithms should operate under Web mining algorithms should operate under

space and time constraintsspace and time constraints. . The Web is naturally The Web is naturally dynamicdynamic. . Web mining algorithms should allow Web mining algorithms should allow

incremental refinementincremental refinement of extracted models. of extracted models.

Page 81: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 81

Mining the Web graph

The Web is a graph and Web sites are The Web is a graph and Web sites are subgraphs.subgraphs.

Web mining algorithms should be aware of Web mining algorithms should be aware of the graphical structure.the graphical structure.

Data mining algorithms for structured and Data mining algorithms for structured and graphical data can lead to new Web mining graphical data can lead to new Web mining applications.applications.

Page 82: NCSR “Demokritos” Institute of Informatics & Telecommunications Knowledge discovery on the Web Georgios Paliouras Email: paliourg@iit.demokritos.gr paliourg@iit.demokritos.gr

© Georgios Paliouras (February 2001) 82

Respecting the user’s privacy Data collection and use should be Data collection and use should be

transparent to the user.transparent to the user. Careless use of personal data will (at the Careless use of personal data will (at the

best) scare users off.best) scare users off. Navigational data is personal, when Navigational data is personal, when

associated with an individual.associated with an individual. ““Unobtrusive personalization” should be Unobtrusive personalization” should be

exercised with cautiousness.exercised with cautiousness. Technology can help safeguarding privacy. Technology can help safeguarding privacy.