document clustering and classification

20
Prepared by: Mahmoud Rafeek Alfarra Seminar Program Document Clustering and Classification

Upload: mahmoud-alfarra

Post on 14-Apr-2017

802 views

Category:

Education


0 download

TRANSCRIPT

Page 1: Document clustering and classification

Prepared by: Mahmoud Rafeek Alfarra

Seminar ProgramDocument

Clustering and Classification

Page 2: Document clustering and classification

Out Line Classification and its techniques Clustering its techniques Document clustering !! Comparison

Page 3: Document clustering and classification

Classification: Definition Given a collection of records (training set )

– Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 4: Document clustering and classification

Classification: Definition

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Page 5: Document clustering and classification

Classification Techniques

Decision Tree based Methods

Rule-based Methods

Memory based reasoning

Neural Networks

Naïve Bayes and Bayesian Belief Networks

Support Vector Machines

Page 6: Document clustering and classification

Artificial Neural Networks (ANN)

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

Output

Input

Output Y is 1 if at least two of the three inputs are equal to 1.

Page 7: Document clustering and classification

Artificial Neural Networks (ANN)

X1 X2 X3 Y1 0 0 01 0 1 11 1 0 11 1 1 10 0 1 00 1 0 00 1 1 10 0 0 0

X1

X2

X3

Y

Black box

0.3

0.3

0.3 t=0.4

Outputnode

Inputnodes

otherwise0 trueis if1

)( where

)04.03.03.03.0( 321

zzI

XXXIY

Page 8: Document clustering and classification

Artificial Neural Networks (ANN)

Model is an assembly of inter-connected nodes and weighted links

Output node sums up each of its input value according to the weights of its links

Compare output node against some threshold t

X1

X2

X3

Y

Black box

w1

t

Outputnode

Inputnodes

w2

w3

)( tXwIYi

ii Perceptron Model

)( tXwsignYi

ii

or

Page 9: Document clustering and classification

Clustering Definition

Clustering is a division of data into groups of similar objects.

Each group is called cluster and consists of objects that are similar between themselves and dissimilar to objects of other groups .

Page 10: Document clustering and classification

Clustering Definition

C3

C2 C1

Page 11: Document clustering and classification

Document clustering

Document clustering is an automatic grouping of text documents into clusters so that documents within a cluster have high similarity in comparison to one another, but are dissimilar to documents in other clusters.

Page 12: Document clustering and classification

The challenge

The problem of Document clustering is how to

organize a large set of documents of various topics

and reach satisfy organization. It can display as follow:

Given: A huge set of documents of various topics

(shared, related, totally different).

Required: Group the documents into a number of clusters

such that the intra-cluster similarity is maximized, and the

inter-cluster similarity is minimized.

Page 13: Document clustering and classification

The challengeDocument cluster Document cluster

Document cluster

Inter-ClusterSim.

Intra-ClusterSim.

Inter-Cluster Sim. < Intra-Cluster Sim.

Page 14: Document clustering and classification

Clustering’s Process

Knowledge

Document Data Model Representation

•Document Cleaning•Feature Selection or Extraction.

Documents samples

Clustering Algorithm

• Similarity Measure • Criterion of Clustering

Cluster Validation

• External Indices• Internal Indices

Results Interpretation

Clusters

1 2

34

Page 15: Document clustering and classification

Clustering Techniques

Clustering methods in general can be viewed

from different perspectives, the most widely

applied to text domain are:

Hierarchical Clustering

Partitioning Clustering

Neural Network based Clustering

Page 16: Document clustering and classification

Clustering Techniques

Suffix Tree Clustering algorithm

05/03/2023 16

D1: cat ate cheeseD2: mouse ate cheese tooD3: cat ate mouse too

5.0

m

nm

BBB

5.0

n

nm

BBBand

Then

Page 17: Document clustering and classification

Clustering Techniques

Document Index Graph for clustering (DIG)

Page 18: Document clustering and classification

Clustering Techniques

Graph based growing hierarchal SOM

Page 19: Document clustering and classification

Comparison

Page 20: Document clustering and classification

Thanks