![Page 1: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/1.jpg)
Self Organization of a Massive Document Collection
Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang
Author : Teuvo Kohonen et al.
![Page 2: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/2.jpg)
Outline
Motivation Objective Introduction Self-Organizing Map Statistical Models of Documents Rapid Construction of Large Document Maps The Document Map of All Electronic Patent Abstracts Conclusion Personal opinion
![Page 3: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/3.jpg)
Motivation
To improve the WEBSOM and to organize vast document collections according to textual similarities.
![Page 4: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/4.jpg)
Objective
The main goal has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data.
![Page 5: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/5.jpg)
Introduction
From Simple Searches to Browsing of Self-Organized Data Collections.
Scope of This Work. WEBSOM Dimensionality
Latent semantic indexing, LSI. Clustering of words into semantic
categories. By a random projection method.
![Page 6: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/6.jpg)
Self Organizing Map
The original SOM algorithm.
(1) )]()()[()()1( ),( tmtxthtmtm iixcii
(2) ||}{||minarg)( ii
mxxc
(3) ))(2
||||exp()()(
2
2)(
),( t
rrtth xci
ixc
![Page 7: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/7.jpg)
Self Organizing Map Batch-map SOM : to accelerate the
computation of the SOM.
(5) )(
(4) 0)]}()()[({E ,
),(
),(*
*),(
tixc
tixc
i
iixct
h
txhm
tmtxthi
![Page 8: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/8.jpg)
Self Organizing Map Let Vi be the set of all x(t) that have as their closest model.Called the Voronoi set. The number of samples x(t) falling into Vi is c
alled .
(7) smoothing )3 Step
(6)
)(
,on quantizati vector 2) Step
method.proper any by theInitialize 1) Step
,
,*
)(
*
jijj
j
jijj
i
i
Vtxi
i
hn
xhn
m
n
tx
xi
m
i
*im
in
![Page 9: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/9.jpg)
Statistical Models of Documents
The histograms formed over word clusters using self-organizing semantic maps. This system was called the WEBSOM.
The overview of the WEBSOM2 system.
![Page 10: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/10.jpg)
![Page 11: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/11.jpg)
Statistical Models of Documents
A. The Primitive Vector-Space Model Inverse document frequency(IDF). Shannon entropy.
B. Latent Semantic Indexing(LSI) Sigular-value decomposition(SVD).
![Page 12: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/12.jpg)
Statistical Models of Documents
C. Randomly Projected Histograms Original document vector Rectangular random matrix R Projections
D. Histograms on the Word Category Map The original version of the WEBSOM. The new method is random projection of the
word histograms.
nin
(8) ii Rnx
mix
![Page 13: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/13.jpg)
Statistical Models of Documents
E. Validation of the Random Projection Method by Small-Scale Preliminary Experiments 13742 patents from the whole corpus of 6840568
abstracts. Equal number of patents from each of the 21
subsections. 1814 words or word forms. With full 1344 D histograms as document vectors.
![Page 14: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/14.jpg)
Statistical Models of Documents
F. Construction of Random Projections of Word Histograms by Pointers. Thresholding(+1 or -1). Sparse matrices(1 and 0).
![Page 15: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/15.jpg)
Statistical Models of Documents
Hash table and pointer. The computing time was about 20%
of that of the usual matrix-product method.
Computational complexity of the random projection with pointers is only
In contrast, the big O of the LSI is
)()( nONlO )(NldO
![Page 16: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/16.jpg)
Rapid Construction of Large Document Maps
A. Fast Distance Computation To tabulate the indexes of the
nonzero components of each input vector.
Euclidean distances between sparse vectors.
We must use low-dimensional models.
![Page 17: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/17.jpg)
Rapid Construction of Large Document Maps
B. Estimation of Larger Maps Based on Carefully Constructed Smaller Ones Increasing the number of nodes of the
SOM during its construction. The new idea is to estimate good
initial values for the model vectors of a very large map on the basis of asymptotic values of the model vectors of a much smaller map.
![Page 18: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/18.jpg)
Rapid Construction of Large Document Maps
(10) )1(ˆ
(9) )1()()()()(
)(')(')(')('
skhh
sjh
sih
dh
skhh
sjh
sih
dh
mmmm
mmmm
densespars
e
![Page 19: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/19.jpg)
Rapid Construction of Large Document Maps
C. Rapid Fine-Tuning of the Large Maps 1) Addressing Old Winners:
This idea is same with LAB! 2) Initialization of the Pointers:
The size of the maps is increased stepwise during learning.~using formula (10).
The winner is the map unit for which the inner product with the data vector is the largest. (11) )1( )()()()( s
kT
hhsj
Th
si
Th
dh
T mxmxmxmx
![Page 20: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/20.jpg)
Rapid Construction of Large Document Maps
![Page 21: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/21.jpg)
Rapid Construction of Large Document Maps
3) Parallelized Batch Map Algorithm: The winner search can be implemented in pa
rallel process. 4) Saving Memory by Reducing Represent
ation Accuracy: The sufficient accuracy can be maintained du
ring the computation.
![Page 22: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/22.jpg)
Rapid Construction of Large Document Maps
D. Performance Evaluation of the New Methods 1) Numerical Comparison with the Traditional SO
M Algorithm: Two performance indexes to measure the quality of th
e maps:Average quantization error and Classification accuracy
Experiments:Two sets of maps
![Page 23: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/23.jpg)
Rapid Construction of Large Document Maps
2) Comparison of the Computational Complexity:
, stems from the computation of the small map. , results from the VQ step(6) of the batch map alg
orithm. , refers to the estimation of the pointers. N:Data Samples; M:Map Units; d:dimensionality.
)( 2dMO
)(dNO
)( 2NO
![Page 24: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/24.jpg)
The Document Map of All Electronic Patent Abstracts
A. Preprocessing We first extracted the titles and the texts for further
processing. We removed nontextual information. Mathematical symbols and numbers were converte
d into special dummy symbols. Contained 733179 different words. A set of common words were removed. The remaining vocabulary consisted of 43222 words. Finally, we omitted the 122524 abstracts in which le
ss than five words remained.
![Page 25: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/25.jpg)
The Document Map of All Electronic Patent Abstracts
B. Formation of Statistical Models The final dimensionality we selected
500 and five random pointers were used for each word.
The words were weighted using the Shannon entropy of their distribution of occurrence among the subsections of the patent classification system.
![Page 26: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/26.jpg)
The Document Map of All Electronic Patent Abstracts
The weight is a measure of the unevenness of the distribution of the word in the subsections.
The weights were calculated as follows: be the probability of a randomly
chosen instance of the word w occurring in subsection g, and Ng the number of subsections.
Shannon entropy Weight
)(wPg
g
gg wPwPwH )(log)()(
gNH
wHHwW
log
)()(
max
max
![Page 27: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/27.jpg)
The Document Map of All Electronic Patent Abstracts
C. Formation of the Document Map 500-dimensional document vectors. The map was increased twice sixteenfold
and one ninefold. Each of the enlarged, estimated maps(cf.
Section IV-B) was then fine-tuned by five batch map iteration cycles.
![Page 28: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/28.jpg)
The Document Map of All Electronic Patent Abstracts
D. Results When each map node was labeled
according to the majority of the subsections in the node.
The resulting accuracy was 64%.
![Page 29: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/29.jpg)
![Page 30: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/30.jpg)
![Page 31: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/31.jpg)
Conclusion
In this paper the emphasis has been on the up scalability of the methods relating to very large text collections.
Contributions: Larger than our previous one. A new method of forming statistical
models of documents. Several new fast computing methods.
![Page 32: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al](https://reader036.vdocuments.us/reader036/viewer/2022062719/56649eeb5503460f94bfcd1a/html5/thumbnails/32.jpg)
Personal Opinion
Put SOM into a domain knowledge,e.g.IR or …?