self organization of a massive document collection advisor : dr. hsu graduate : sheng-hsuan wang...

32
Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Koho nen et al.

Upload: rachel-williamson

Post on 02-Jan-2016

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Self Organization of a Massive Document Collection

Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang

Author : Teuvo Kohonen et al.

Page 2: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Outline

Motivation Objective Introduction Self-Organizing Map Statistical Models of Documents Rapid Construction of Large Document Maps The Document Map of All Electronic Patent Abstracts Conclusion Personal opinion

Page 3: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Motivation

To improve the WEBSOM and to organize vast document collections according to textual similarities.

Page 4: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Objective

The main goal has been to scale up the SOM algorithm to be able to deal with large amounts of high-dimensional data.

Page 5: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Introduction

From Simple Searches to Browsing of Self-Organized Data Collections.

Scope of This Work. WEBSOM Dimensionality

Latent semantic indexing, LSI. Clustering of words into semantic

categories. By a random projection method.

Page 6: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Self Organizing Map

The original SOM algorithm.

(1) )]()()[()()1( ),( tmtxthtmtm iixcii

(2) ||}{||minarg)( ii

mxxc

(3) ))(2

||||exp()()(

2

2)(

),( t

rrtth xci

ixc

Page 7: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Self Organizing Map Batch-map SOM : to accelerate the

computation of the SOM.

(5) )(

(4) 0)]}()()[({E ,

),(

),(*

*),(

tixc

tixc

i

iixct

h

txhm

tmtxthi

Page 8: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Self Organizing Map Let Vi be the set of all x(t) that have as their closest model.Called the Voronoi set. The number of samples x(t) falling into Vi is c

alled .

(7) smoothing )3 Step

(6)

)(

,on quantizati vector 2) Step

method.proper any by theInitialize 1) Step

,

,*

)(

*

jijj

j

jijj

i

i

Vtxi

i

hn

xhn

m

n

tx

xi

m

i

*im

in

Page 9: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Statistical Models of Documents

The histograms formed over word clusters using self-organizing semantic maps. This system was called the WEBSOM.

The overview of the WEBSOM2 system.

Page 10: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al
Page 11: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Statistical Models of Documents

A. The Primitive Vector-Space Model Inverse document frequency(IDF). Shannon entropy.

B. Latent Semantic Indexing(LSI) Sigular-value decomposition(SVD).

Page 12: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Statistical Models of Documents

C. Randomly Projected Histograms Original document vector Rectangular random matrix R Projections

D. Histograms on the Word Category Map The original version of the WEBSOM. The new method is random projection of the

word histograms.

nin

(8) ii Rnx

mix

Page 13: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Statistical Models of Documents

E. Validation of the Random Projection Method by Small-Scale Preliminary Experiments 13742 patents from the whole corpus of 6840568

abstracts. Equal number of patents from each of the 21

subsections. 1814 words or word forms. With full 1344 D histograms as document vectors.

Page 14: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Statistical Models of Documents

F. Construction of Random Projections of Word Histograms by Pointers. Thresholding(+1 or -1). Sparse matrices(1 and 0).

Page 15: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Statistical Models of Documents

Hash table and pointer. The computing time was about 20%

of that of the usual matrix-product method.

Computational complexity of the random projection with pointers is only

In contrast, the big O of the LSI is

)()( nONlO )(NldO

Page 16: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Rapid Construction of Large Document Maps

A. Fast Distance Computation To tabulate the indexes of the

nonzero components of each input vector.

Euclidean distances between sparse vectors.

We must use low-dimensional models.

Page 17: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Rapid Construction of Large Document Maps

B. Estimation of Larger Maps Based on Carefully Constructed Smaller Ones Increasing the number of nodes of the

SOM during its construction. The new idea is to estimate good

initial values for the model vectors of a very large map on the basis of asymptotic values of the model vectors of a much smaller map.

Page 18: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Rapid Construction of Large Document Maps

(10) )1(ˆ

(9) )1()()()()(

)(')(')(')('

skhh

sjh

sih

dh

skhh

sjh

sih

dh

mmmm

mmmm

densespars

e

Page 19: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Rapid Construction of Large Document Maps

C. Rapid Fine-Tuning of the Large Maps 1) Addressing Old Winners:

This idea is same with LAB! 2) Initialization of the Pointers:

The size of the maps is increased stepwise during learning.~using formula (10).

The winner is the map unit for which the inner product with the data vector is the largest. (11) )1( )()()()( s

kT

hhsj

Th

si

Th

dh

T mxmxmxmx

Page 20: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Rapid Construction of Large Document Maps

Page 21: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Rapid Construction of Large Document Maps

3) Parallelized Batch Map Algorithm: The winner search can be implemented in pa

rallel process. 4) Saving Memory by Reducing Represent

ation Accuracy: The sufficient accuracy can be maintained du

ring the computation.

Page 22: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Rapid Construction of Large Document Maps

D. Performance Evaluation of the New Methods 1) Numerical Comparison with the Traditional SO

M Algorithm: Two performance indexes to measure the quality of th

e maps:Average quantization error and Classification accuracy

Experiments:Two sets of maps

Page 23: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Rapid Construction of Large Document Maps

2) Comparison of the Computational Complexity:

, stems from the computation of the small map. , results from the VQ step(6) of the batch map alg

orithm. , refers to the estimation of the pointers. N:Data Samples; M:Map Units; d:dimensionality.

)( 2dMO

)(dNO

)( 2NO

Page 24: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

The Document Map of All Electronic Patent Abstracts

A. Preprocessing We first extracted the titles and the texts for further

processing. We removed nontextual information. Mathematical symbols and numbers were converte

d into special dummy symbols. Contained 733179 different words. A set of common words were removed. The remaining vocabulary consisted of 43222 words. Finally, we omitted the 122524 abstracts in which le

ss than five words remained.

Page 25: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

The Document Map of All Electronic Patent Abstracts

B. Formation of Statistical Models The final dimensionality we selected

500 and five random pointers were used for each word.

The words were weighted using the Shannon entropy of their distribution of occurrence among the subsections of the patent classification system.

Page 26: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

The Document Map of All Electronic Patent Abstracts

The weight is a measure of the unevenness of the distribution of the word in the subsections.

The weights were calculated as follows: be the probability of a randomly

chosen instance of the word w occurring in subsection g, and Ng the number of subsections.

Shannon entropy Weight

)(wPg

g

gg wPwPwH )(log)()(

gNH

wHHwW

log

)()(

max

max

Page 27: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

The Document Map of All Electronic Patent Abstracts

C. Formation of the Document Map 500-dimensional document vectors. The map was increased twice sixteenfold

and one ninefold. Each of the enlarged, estimated maps(cf.

Section IV-B) was then fine-tuned by five batch map iteration cycles.

Page 28: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

The Document Map of All Electronic Patent Abstracts

D. Results When each map node was labeled

according to the majority of the subsections in the node.

The resulting accuracy was 64%.

Page 29: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al
Page 30: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al
Page 31: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Conclusion

In this paper the emphasis has been on the up scalability of the methods relating to very large text collections.

Contributions: Larger than our previous one. A new method of forming statistical

models of documents. Several new fast computing methods.

Page 32: Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al

Personal Opinion

Put SOM into a domain knowledge,e.g.IR or …?