1 clustering categorical data the case of quran verses presented by muhammad al-watban is 598

43
1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

Upload: efren-hayton

Post on 14-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

1

Clustering Categorical Data

The Case of Quran Verses

Presented ByMuhammad Al-Watban

IS 598

Page 2: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

2

Outline

Introduction Preprocessing of Quran Verses Similarity Measures Assisting Clusters Similarities Shortcomings of Traditional clustering methods with

categorical data ROCK - Major definitions ROCK clustering Algorithm ROCK example Conclusion and future work

Page 3: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

3

Introduction

The holy Quran covers a wide range of topics.

Quran does not cover each topic by a set of sequenced verses or sura’s.

A single verse usually deals with many subjects

Project goal: to cluster the verses of The Holy Quran based on the verse’s subjects.

Page 4: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

4

Preprocessing of Quran Verses

it is necessary to perform manual preprocessing for the Quran text to capture the subjects of the verses into a tabular format

Verses in the Holy Quran can be viewed as records and the related subjects as attributes of the record. This is demonstrated by the following table:

The data in the above table is similar to what is known as market-basket data.

Here, we will call it verses-treasues data

Page 5: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

5

Similarity Measures

Two types of attributes:1. Continuous attributes:

range of attribute value is continuous and ordered includes Attributes with numeric values (e.g. salary) also includes attributes whose allowed set of values are

thought to be part of an ordered set of a meaningful sequence (e.g. professional ranks, disease severity levels)

The similarity (or dissimilarity) between objects is computed based on distance between them.

the most commonly used distance measure is Euclidean distance, and Manhattan distance

Page 6: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

6

Similarity Measures

2. Categorical attributes: consists of attributes whose underlying domain is not ordered Examples : colors, blood type. If the attribute has only two states (namely 0 and 1), then it is

called binary; if it has more than two states, it is called nominal.

there is no easy way to measure a distance between objects We can define dissimilarity based on the simple matching

approach

Where m is the number of matched attribute, and p is the total number of attributes.

pmpjid ),(

Page 7: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

7

Similarity Measures

Where does the verses treasures data fit? Each verse can be represented by a record with

Boolean attributes, each attribute corresponds to a single subject

The attribute corresponding to a subject is T if the verse contains that subjects; otherwise, it is F

As we said, Boolean attributes are a special case of categorical attributes

Page 8: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

8

Assisting Clusters Similarities

Many clustering algorithm(such as hirarchical clustering) requires computing distance between clusters (rather than elements)

There are several standard methods:

1- Single linkage: D(r,s): distance between clusters r and s is defined as the

distance between the closest pair of objects

D(r,s)

Page 9: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

9

Assisting Clusters Similarities

2. Complete linkage distance is defined as the distance between the farthest pair of

objects

3. Average linkage distance is defined as the average of distances between all pairs

of objects r and s, where r and s belong to different clusters

D(r,s)

Page 10: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

10

Assisting Clusters Similarities

4. Centroid Linkage:   distance between clusters is defined as the

distance between the pair of cluster centroids.  

D(r,s)

Page 11: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

11

Shortcomings of Traditional clustering methods with categorical data

Example Consider the following 4 market basket transactions

T1= {1, 2, 3, 4} T2= {1, 2, 4} T3= {3} T4= {4}

converting these transactions to Boolean points, we get: P1= (1, 1, 1, 1) P2= (1, 1, 0, 1) P3= (0, 0, 1, 0) P4= (0, 0, 0, 1)

using Euclidean distance to measure the closeness between all pairs of points, we find that d(p1,p2) is the smallest distance :

1)|11||01||11||11(|)2,1( 2222 ppd

Page 12: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

12

Shortcomings of Traditional clustering methods with categorical data

If we use the centroid-based hierarchical algorithm then we merge P1 and P2 and get a new cluster (P12) with (1, 1, 0.5, 1) as a centroid

Then, using Euclidean distance again, we find: d(p12,p3)= 3.25 d(p12,p4)= 2.25 d(p3,p4)= 2

So, we should merge P3 and P4 since the distance between them is the shortest.

However, T3 and T4 don't have even a single common item. So, using distance metrics as similarity measure for categorical

data is not appropriate The solution is ROCK

Page 13: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

13

ROCK - Major definitions

Similarity function Neighbors Links Criterion function Goodness measure

Page 14: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

14

Similarity function

Let Sim (Pi, Pj) be a similarity function that is used to measure the closeness between points pi and Pj.

ROCK assumes that Sim function is normalized to return a value between 0 and 1

For Quran treasures data, a possible definition for the sim function is based on the Jaccard coefficient:

||||),(

PjPiPjPi PjPisim

Page 15: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

15

Example : similarity function

Suppose two verses (P1 and P2) contain the following subjects P1={ judgment, faith, prayer, fair} P2={ fasting, faith, prayer}

Sim(P1,P2)= | P1 P2| / | P1P2| = 2 / 5 = 0.40

Page 16: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

16

Major definitions

Similarity for data objects Neighbors Links Criterion function Goodness measure

Page 17: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

17

Neighbors and Links

one main problem of traditional clustering is:local properties involving only the two points are considered.

Neighbor If similarity between two points exceeds certain similarity

threshold (), they are neighbors. Link

The Link for pair of points is: the number of their common neighbors.

Obviously, Link incorporates global information about the other points in the neighborhood of the two points. The larger the Link, the higher probability that this pair of points are in the same clusters.

Page 18: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

18

Example : neighboring and linking

Example : Assume that we have three distinct points: p1,p2 and p3;

where neighbor(p1)={p1,p2} neighbor(p2)={p1,p2,3} neighbor(p3)={p3,p2}

Neighboring graph To define the number of links between two points, say p1 and

p3, we have to find the number of their common neighbors; hence, we can define the linkage function between p1 and p3 to be:

Link (p1,p3) = | neighbor(p1) neighbor(p3) |= | {P2}| Or Link (p1,p3) = 1

Page 19: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

19

Example : minimum linkages

If we have four points:P1,P2,P3,P4 suppose that similarity threshold () is equal to 1 Then, Two Points are neighbors if sim(Pi,Pj)>=1 hence, points are considered neighbors only to

identical points (i.e. only to themselves) To find Link(P1,P2):

neighbor(P1)={P1} neighbor(P2)={P2}

link (P1,P2)= |neighbor(p1) neighbor(p2) | =0

Page 20: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

20

The following table shows the number of links (common neighbors) between the four points:

We can depict the neighboring graph:

Page 21: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

21

Example : maximum linkages

If we have four points:P1,P2,P3,P4 suppose that similarity threshold () is equal to 0 Then, Two Points are neighbors if sim(Pi,Pj)>=0 hence, any pair of points are neighbors To find Link(P1,P2):

neighbor(P1)={P1,P2,P3,P4} neighbor(P2)={P1,P2,P3,P4}

link (P1,P2)= |neighbor(P1) neighbor(P2) | =4

Page 22: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

22

The following table shows the number of links (common neighbors) between the four points:

We can depict the neighboring graph:

Page 23: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

23

Example :illustrating links

from the previous example, we have: neighbor(P1)={P1,P2,P3,P4} neighbor(P3)={P1,P2,P3,P4}

link (P1,P3)= |neighbor(P1) neighbor(P3) | =4 links we can depict these four different links (or paths) through these

four different neighbors as follows:

Page 24: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

24

Major definitions

Similarity for data objects Neighbors Links Criterion function Goodness measure

Page 25: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

25

Criterion function

to get the best clusters, we have to maximize this Criterion Function

Where Ci denotes cluster i ni is the number of points in Ci k is the number of clusters is the similarity threshold

• Suppose in Ci, each point has roughly nf(θ) neighbors.• A suitable choice for basket data is : f(θ)=(1-θ)/(1+θ)

irq Cpp

f

i

rqk

iil n

pplinknE

,)(21

1

),(

Page 26: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

26

Criterion function

By maximizing this criterion function, we are maximizing the sum of links of intra cluster point pairs and at the same time minimizing the sum of links among pairs of points belonging to different clusters (i.e. among inter cluster point pairs)

Page 27: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

27

Major definitions

Similarity for data objects Neighbors Links Criterion function Goodness measure

Page 28: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

28

Goodness measure

Goodness Function

During clustering, we use this goodness measure in order to maximize the criterion function.

This goodness measure helps to identify the best pair of clusters to be merged during each step of ROCK.

)(21)(21)(21)(

],[),( f

j

f

i

f

ji

ji

ji nnnn

CClinkCCg

Page 29: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

29

ROCK Clustering algorithm

Input: A set S of data points Number of k clusters to be found The similarity threshold Output: Groups of clustered data

The ROCK algorithm is divided into three major parts:1. Draw a random sample from the data set:2. Perform a hierarchical agglomerative clustering algorithm3. Label data on disk

in our case, we do not deal with a very huge data set. So, we will consider the whole data in the process of forming clusters, i.e. we skip step1 and step3

Page 30: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

30

ROCK Clustering algorithm

1. Draw a random sample from the data set: sampling is used to ensure scalability to

very large data sets The initial sample is used to form clusters,

then the remaining data on disk is assigned to these clusters

in our case, we will consider the whole data in the process of forming clusters.

Page 31: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

31

ROCK Clustering algorithm

2. Perform a hierarchical agglomerative clustering algorithm:

ROCK performs the following steps which are common to all hierarchical agglomerative clustering algorithms, but with different definition to the similarity measures:

a. places each single data point into a separate clusterb. compute the similarity measure for all pairs of clustersc. merge the two clusters with the highest similarity

(goodness measure)d. Verify a stop condition. If it is not met then go to step b

Page 32: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

32

3. Label data on disk:

Finally, the remaining data points in the disk are assigned to the generated clusters.

This is done by selecting a random sample Li from each cluster Ci, then we assign each point p to the cluster for which it has the strongest linkage with Li.

As we said, we will consider the whole data in the process of forming clusters.

Page 33: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

33

ROCK Clustering algorithm

Computation of links:

1. using the similarity threshold , we can convert the similarity matrix into an adjacency matrix (A)

2. Then we obtain a matrix indicating the number of links by calculating (A x A ) , i.e., by multiplying the adjacency matrix A with itself

Page 34: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

34

ROCK Example

Suppose we have four verses contains some subjects , as follows: P1={ judgment, faith, prayer, fair} P2={ fasting, faith, prayer} P3={ fair, fasting, faith} P4={ fasting, prayer, pilgrimage} the similarity threshold = 0.3, and number of required cluster is 2. using Jaccard coefficient as a similarity measure,

we obtain the following similarity table :

Page 35: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

35

ROCK Example

Since we have a similarity threshold equal to 0.3, then we derive the adjacency table:

By multiplying the adjacency table with itself, we derive the following table which shows the number of links (or common neighbors) :

Page 36: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

36

ROCK Example

we compute the goodness measure for all adjacent points ,assuming that f() =1- / 1+

we obtain the following table:

we have an equal goodness measure for merging ((P1,P2), (P2,P1), (P3,P1))

)(21)(21)(21)(

],[),( fff

jiji mnmn

PPlinkPPg

Page 37: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

37

ROCK Example

Now, we start the hierarchical algorithm by merging, say P1 and P2.

A new cluster (let’s call it C(P1,P2)) is formed. It should be noted that for some other hierarchical

clustering techniques, we will not start the clustering process by merging P1 and P2, since Sim(P1,P2) = 0.4,which is not the highest. But, ROCK uses the number of links as the similarity measure rather than distance.

Page 38: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

38

ROCK Example

Now, after merging P1 and P2, we have only three clusters. The following table shows the number of common neighbors for these clusters:

Then we can obtain the following goodness measures for all adjacent clusters:

Page 39: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

39

ROCK Example

Since the number of required clusters is 2, then we finish the clustering algorithm by merging C(P1,P2) and P3, obtaining a new cluster C(P1,P2,P3) which contains {P1,P2,P3} leaving P4 alone in a separate cluster.

Page 40: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

40

Conclusion and future work (1/3)

We aim to apply a clustering technique on the verses of the Holy Quran

We should first perform manual preprocessing for the Quran text to capture the subjects of the verses into a tabular format.

Then we can apply a clustering algorithm which clusters each set of similar verses into the same group.

Page 41: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

41

Conclusion and future work (2/3)

Most traditional clustering algorithm uses distance based similarity measures which is not appropriate for clustering our categorical-type datasets.

we will apply the general framework of the ROCK algorithm.

The ROCK (RObust Clustering using linKs) algorithm is an agglomerative hierarchical clustering algorithm for clustering categorical data. It presents a new notion of link to measure similarity between data objects.

Page 42: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

42

Conclusion and future work (3/3)

We will adopt JAVA language to implement ROCK clustering algorithm.

During testing, will try to form clusters of verses belonging to a single sura, and verses belonging to many different suras.

Insha Allah, we will achieve success in performing this mission.

Page 43: 1 Clustering Categorical Data The Case of Quran Verses Presented By Muhammad Al-Watban IS 598

43

Thank You for your attention

I will be glad to answer your questions