bayesian networks in document clustering
DESCRIPTION
Bayesian Networks in Document Clustering. Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science, Polish Academy of Sciences Warsaw. - PowerPoint PPT PresentationTRANSCRIPT
Bayesian Networks in Document Bayesian Networks in Document ClusteringClustering
Slawomir Wierzchon , Mieczyslaw KlopotekMichal Draminski Krzysztof Ciesielski
Mariusz Kujawiak
Institute of Computer Science, Polish Academy of SciencesWarsaw
Research partially supported by the KBN research project 4 T11C 026 25 "Maps and intelligent navigation in WWW using
Bayesian networks and artificial immune systems"
A search engine with SOM-based document set
representation
Map visualizations in 3D (BEATCA)
........
INTERNET
DBREGISTRY
HT-Base
HT-Base
VEC-BaseMAP-Base
DocGR-Base
Search Engine
Indexing +Optimizing
SpiderDownloading
MappingClustering
of docs
........
CellGR-Base
Clusteringof cells
........
........ ........ ........
Processing Flow Diagram - BEATCA
The preparation of documents is done by an indexer, which turns a document into a vector-space model representation
Indexer also identifies frequent phrases in document set for clustering and labelling purposes
Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded
The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation
‘The best’ (wrt some similarity measure) map is used by the query processor in response to the user’s query
Document model in search engines
In the so-called vector model a document is considered as a vector in space spanned by the words it contains.
dogfood
walk
My dog likes this food
When walking, I take some food
Clustering document vectors
Document space 2D map
mxr
Mocna zmiana położenia (gruba
strzałka)
Important difference to general clustering: not only clusters with similar documents, but also neighboring clusters similar
Our problem
Instability Pre-defined major themes needed
Our approach– Find a coarse clustering into a few themes
Bayesian Networks in Document Clustering
SOM document-map based search engines require initial document clustering in order to present results in a meaningful way.
Latent semantic Indexing based methods appear to be promising for this purpose.
One of them, the PLSA, has been empirically investigated.
A modification is proposed to the original algorithm and an extension via TAN-like Bayesian networks is suggested.
A Bayesian Network
chappiR
dog
owner
food
Meat
walk
Represents joint probability distribution as a product of conditional probabilities of childs on parents in a directed acyclic graph
High compression,
Simpliofication of reasoning
.
BN application in text processing
Document classification Document Clustering Query Expansion
Hidden variable approaches
PLSA (Probabilistic Latent Semantic Analysis) PHITS (Probabilistic Hyperlink Analysis) Combined PLSA/PHITS Assumption of a hidden variable expressing the
topic of the document. The topic probabilistically influence the
appearence of the document (links in PHITS, terms in PLSA)
PLSA - concept N be term-document matrix of
word counts, i.e., Nij denotes how often a term (single word or phrase) ti occurs in document dj.
probabilistic decomposition into factors zk (1 k K)
P(ti | dj) = Σk P(ti|zk)P(zk|dj), with non-negative probabilities and two sets of normalization constraints
Σi P(ti|zk) = 1 for all k and Σk P(zk| dj) = 1 for all j.
D Z
T1
T2
Tn
.....
Hidden variable
PLSA - concept PLSA aims at maximizing L:=
Σi,j Nij log Σk P(ti|zk)P(zk|dj). Factors zk can be interpreted as
states of a latent mixing variable associated with each observation (i.e., word occurrence),
Expectation-Maximization (EM) algorithm can be applied to find a local maximum of L.
...
..
D Z
T1
T2
Tn
Hidden variable
different factors usually capture distinct "topics" of a document collection; by clustering documents according to their dominant factors, useful topic-specific document clusters often emerge
EM algorithm – step 0Data:D Z T1 T2 ... Tn1 ? 1 0 ... 12 ? 0 0 ... 13 ? 1 1 ... 14 ? 0 1 ... 15 ? 1 0 ... 0..........
Data:D Z T1 T2 ... Tn1 1 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 1 0 1 ... 15 2 1 0 ... 0..........
Z randomly initialized
EM algorithm – step 1
Data:D Z T1 T2 ... Tn1 1 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 1 0 1 ... 15 2 1 0 ... 0..........
BN trained D Z
T1
T2
Tn
Hidden variable
EM algorithm step 2
Data:D Z T1 T2 ... Tn1 2 1 0 ... 12 2 0 0 ... 13 1 1 1 ... 14 2 0 1 ... 15 1 1 0 ... 0..........
Z sampled from BN
D Z
T1
T2
Tn
Hidden variable
GOTO step 1 untill convergence (Z assignment „stable”)
Z sampled for each record according to the probability distribution
P(Z=1|D=d,T1=t1,...,Tn=tn)P(Z=2|D=d,T1=t1,...,Tn=tn)....
The problem
Too high number of adjustable variables Pre-defined clusters not identified Long computation times instability
Solution
Our suggestion
– Use the „Naive Bayes” „sharp version” – document assigned to the „most probable class”
We were successful – Up to five classes well clustered– High speed (with 20,000 documents)
Next step
Naive bayes assumes document and term independence
What if they are in fact dependent?
Our solution:– TAN APPROACH– First we create a BN of terms/documents– Then assume there is a hidden variable
Promissing results, need a deeper study
PLSA – a model with term TAN
D1
Z
T6T5
T4
Hidden variable
D2
Dk
T1
T2T3
PLSA – a model with document TAN
T1
Z
Hidden variable
T2
Ti
D6D5
D4
D1
D2D3