bayesian networks in document clustering

Bayesian Networks in Document Bayesian Networks in Document ClusteringClustering

Slawomir Wierzchon , Mieczyslaw KlopotekMichal Draminski Krzysztof Ciesielski

Mariusz Kujawiak

Institute of Computer Science, Polish Academy of SciencesWarsaw

Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using

Bayesian networks and artificial immune systems"

A search engine with SOM-based document set

representation

Map visualizations in 3D (BEATCA)

........

INTERNET

DBREGISTRY

HT-Base

HT-Base

VEC-BaseMAP-Base

DocGR-Base

Search Engine

Indexing +Optimizing

SpiderDownloading

MappingClustering

of docs

........

CellGR-Base

Clusteringof cells

........

........ ........ ........

Processing Flow Diagram - BEATCA

The preparation of documents is done by an indexer, which turns a document into a vector-space model representation

Indexer also identifies frequent phrases in document set for clustering and labelling purposes

Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded

The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation

‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query

Document model in search engines

In the so-called vector model a document is considered as a vector in space spanned by the words it contains.

dogfood

walk

My dog likes this food

When walking, I take some food

Clustering document vectors

Document space 2D map

mxr

Mocna zmiana położenia (gruba

strzałka)

Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar

Our problem

Instability Pre-defined major themes needed

Our approach– Find a coarse clustering into a few themes

Bayesian Networks in Document Clustering

SOM document-map based search engines require initial document clustering in order to present results in a meaningful way.

Latent semantic Indexing based methods appear to be promising for this purpose.

One of them, the PLSA, has been empirically investigated.

A modification is proposed to the original algorithm and an extension via TAN-like Bayesian networks is suggested.

A Bayesian Network

chappiR

dog

owner

food

Meat

walk

Represents joint probability distribution as a product of conditional probabilities of childs on parents in a directed acyclic graph

High compression,

Simpliofication of reasoning

.

BN application in text processing

Document classification Document Clustering Query Expansion

Hidden variable approaches

PLSA (Probabilistic Latent Semantic Analysis) PHITS (Probabilistic Hyperlink Analysis) Combined PLSA/PHITS Assumption of a hidden variable expressing the

topic of the document. The topic probabilistically influence the

appearence of the document (links in PHITS, terms in PLSA)

PLSA - concept N be term-document matrix of

word counts, i.e., Nij denotes how often a term (single word or phrase) ti occurs in document dj.

probabilistic decomposition into factors zk (1 k K)

P(ti | dj) = Σk P(ti|zk)P(zk|dj), with non-negative probabilities and two sets of normalization constraints

Σi P(ti|zk) = 1 for all k and Σk P(zk| dj) = 1 for all j.

D Z

T1

T2

Tn

.....

Hidden variable

PLSA - concept PLSA aims at maximizing L:=

Σi,j Nij log Σk P(ti|zk)P(zk|dj). Factors zk can be interpreted as

states of a latent mixing variable associated with each observation (i.e., word occurrence),

Expectation-Maximization (EM) algorithm can be applied to find a local maximum of L.

...

..

D Z

T1

T2

Tn

Hidden variable

different factors usually capture distinct "topics" of a document collection; by clustering documents according to their dominant factors, useful topic-specific document clusters often emerge

EM algorithm – step 0Data:D Z T1 T2 ... Tn1 ? 1 0 ... 12 ? 0 0 ... 13 ? 1 1 ... 14 ? 0 1 ... 15 ? 1 0 ... 0..........

Data:D Z T1 T2 ... Tn1 1 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 1 0 1 ... 15 2 1 0 ... 0..........

Z randomly initialized

EM algorithm – step 1

Data:D Z T1 T2 ... Tn1 1 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 1 0 1 ... 15 2 1 0 ... 0..........

BN trained D Z

T1

T2

Tn

Hidden variable

EM algorithm step 2

Data:D Z T1 T2 ... Tn1 2 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 2 0 1 ... 15 1 1 0 ... 0..........

Z sampled from BN

D Z

T1

T2

Tn

Hidden variable

GOTO step 1 untill convergence (Z assignment „stable”)

Z sampled for each record according to the probability distribution

P(Z=1|D=d,T1=t1,...,Tn=tn)P(Z=2|D=d,T1=t1,...,Tn=tn)....

The problem

Too high number of adjustable variables Pre-defined clusters not identified Long computation times instability

Solution

Our suggestion

– Use the „Naive Bayes” „sharp version” – document assigned to the „most probable class”

We were successful – Up to five classes well clustered– High speed (with 20,000 documents)

Next step

Naive bayes assumes document and term independence

What if they are in fact dependent?

Our solution:– TAN APPROACH– First we create a BN of terms/documents– Then assume there is a hidden variable

Promissing results, need a deeper study

PLSA – a model with term TAN

D1

Z

T6T5

T4

Hidden variable

D2

Dk

T1

T2T3

PLSA – a model with document TAN

T1

Z

Hidden variable

T2

Ti

D6D5

D4

D1

D2D3

bayesian networks in document clustering

Documents

document dj

initial document clustering

document links

sombased document

general clustering

coarse clustering

vectorspace representation

themesbayesian networks