clei 2007 1 measuring contribution of html features in web document clustering oldemar rodríguez...

Post on 26-Mar-2015

214 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CLEI 2007 1

Measuring Contribution of HTML Features in WebDocument Clustering

Oldemar Rodríguez

School of Mathematics, UCR

and Predisoft

Esteban Meneses

Computing Research Center, ITCR

CLEI 2007 2

Motivation

CLEI 2007 3

Motivation

Which HTML feature is the most important to provide good clustering results?

Using symbolic objects to cluster web documents.

15th World Wide Web Conference (2006)

CLEI 2007

HTML Document Clustering

Find meaningful groups from a web document collection.

Effectively represent web document clusters for further analysis.

CLEI 2007 5

HTML Document

CLEI 2007 6

CLEI 2007 7

Classical Representations

• Different approaches for representing a web document.

<5,22,19,4,...,38>

CLEI 2007 8

Vectorial Representation

• Every document is represented by a vector inn-dimensional space.

• Bag of words scheme. Each variable represents the relative weight of a term in the document.

CLEI 2007 9

Symbolic Objects

• Real-life objects are too complex to be represented by points in a vectorial space. [Bock&Diday, 2000]

• Symbolic objects overcome this limitation by representing concepts rather than individuals.

• In a symbolic data array each variable can have one of many data types: sets, intervals, histograms, trees, graphs, functions, fuzzy data, etc.

CLEI 2007

Symbolic Data Table

CLEI 2007

Multivariate Numeric Analysis

Individual Age Profession Wage Location

3457 36 Lawyer 2,500.00 San José

1251 28 Teacher 1,750.00 Alajuela

3245 39 Doctor 2,400.00 San José

7635 33 Teacher 1,900.00 Alajuela

3245 35 Engineer 1,850.00 Alajuela

5367 27 Engineer 1,900.00 Heredia

6486 34 Manager 1,600.00 Heredia

Individual Age Profession Wage

San José [36,39] {Law, 50%,Doc,50%} [2,4 – 2,5]

Alajuela [28,35] {Tea,66%,Eng,33%} [1,75 – 1,9]

Heredia [2,34] {Eng,50%,Mgn,50%} [1,6 – 1,9]

Multivariate Symbolic Analysis

Millions…

Hundreds…

Data

Concepts

From relational data bases to symbolic data bases

Symbolic Data Table

CLEI 2007 12

Relational Data Base Symbolic Data Base

100% knowledge

15 Gigabyte

90 % knowledge

10.3 Megabyte

Symbolic Data Base

CLEI 2007 13

Symbolic Representations

• A complex representation that takes into account: term frequency, word order and phrases.

CLEI 2007 14

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

The K-Means Clustering Method

CLEI 2007 15

But, there are some problems …….

CLEI 2007 16

Distance Measures

CLEI 2007 17

Teorema: Igualdad de Fisher

• Inercia total Inercia total = Inercia inter-clases Inercia inter-clases

+ +

Inercia intra-clasesInercia intra-clases

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CLEI 2007 18

1. Representar una clase por su centro de gravedad, esto es, por su vector de promedios.

2. ¿Qué es el centro de gravedad?

Problemas en el caso simbólico:

CLEI 2007

¿Qué el centro de gravedad?

CLEI 2007

CLEI 2007 21

Evaluation Criteria

1. Rand Index

2. Mutual Information

3. F-Measure

4. Entropy

CLEI 2007 22

Experiments

CLEI 2007 23

Experiments

CLEI 2007 24

Experiments

CLEI 2007 25

Experiments

Text 0.2894

Title 0.2584

Bold 0.0379

Anchor 0.1689

Header 0.1009

Graph 0.1229

Tree 0.0212

WebKB

Text 0.7035

Graph 0.2515

Tree 0.0449

20Newsgroup

CLEI 2007 26

Conclusions

• Symbolic representations are richer and more flexible than classical representations.

• The text in the HTML document seems to be the more important factor to cluster HTML documents.

CLEI 2007 27

Thank you!

top related