efficient visualization of document streams

Efficient Visualization of Document Streams

Miha Grčar {[email protected]}Vid PodpečanMatjaž Juršič

Prof. Dr. Nada Lavrač

Jozef Stefan Institute, Dept. of Knowledge TechnologiesLjubljana, Slovenia

Discovery Science, Canberra, October 2010

Outline

Motivation Original algorithm

Document corpus visualization pipeline Our modified algorithm

Visualization of document streams Experiments (speed tests) Conclusions and further work

Canberra, October 2010 DS 2010 2

MotivationVisualization of Document Corpora


MotivationGoal: Visualization of Document Streams

Documentstream

Outdateddocuments


Corpus Visualization Pipeline


Neighborhoodscomputation

Stressmajorization

Least-squaresinterpolation

k-means clustering

Corpus preprocessing

Documentcorpus

Layout

Paulovich et al. (2006)


Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Tokenization Stop-word removal Lemmatization n-grams

Sparse TF-IDF vectors in a high-dimensional space




interpolation


Iterativemethod



interpolation


High-dimensional 2D

Iterativemethod



interpolation




interpolation

pi = (1/|Np|)rNpr

pi + rNp(–1/k)r = (0, 0),

k = |Np|

ci = (xi*, yi

*)


A X = B

(x1,y1)

(x2,y2)……

…

(xn-1,yn-1)

(xn,yn)

(0,0)

(0,0)

……

…

(0,0)

(0,0)

-1/k 1 -1/k -1/k

1

1

1

1

1

1

1

1

1

1

=

(x1*,y1*)……

(xr*,yr*)

1

1

1

1

Iterativemethod

arg minX{||AX – B||2}

Stream Visualization Pipeline



Stress majorization


k-means clustering

Preprocessing

Buffer(FIFO)

Documentstream

Outdateddocuments


Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

TF-IDF weights TF: the number of

times the term occurs in the document

DF: the number of documents in the corpus containing the term

IDF: log(|D| / DF) Not possible to

compute IDF from (infinite) real-time streams


TF vector



interpolation


TF vector

TF vector

TF vector

TF vector

TF vector

VocabularyDF values

TF-IDF vector



interpolation


Warmstart!



interpolation


Remove outdated instances

Add new instances

…



interpolation


(x1,y1)

(x2,y2)

……

(xn-1,yn-1)

(xn,yn)

(0,0)

(0,0)

……

(0,0)

(0,0)

1

1

1

1

1

1

1

1

1

1

=

(x1*,y1*)……

(xr*,yr*)

1

1

1

1

(x3,y3)

(x4,y4)

(0,0)

(0,0)

1 (x3,y3)

(x4,y4)……

…

(xn-1,yn-1)

(xn,yn)

(0,0)

(0,0)

……

…

(0,0)

(0,0)

1

1

1

1

1

1

1

1

1

1

=

(x1*,y1*)……

(xr*,yr*)

1

1

1

1

1

Remove outdated instances

Add new instances

Warmstart!

Speed Tests

First 30,000 news from Reuters Corpus Vol. 1 (“natural” rate: 1.4 news / minute)

Experimental setting Maximum rate? 10 news in a batch (u = 10) Buffer capacity: nQ = 5,000 news 100 control points, 30 + 30 neighbors


Speed Tests


Speed Tests



Stress majorization


k-means clustering

Preprocessing

Buffer(FIFO)

Documentstream

Outdateddocuments

Processing delay: ~9 sec

Exit delay: ~4 sec

Exit frequency: ~1 / 4 batches per sec (2.5 docs / sec)

+ 4 sec to form a batch

Speed Tests


Conclusions and Further Work

Conclusions Efficient online distance-preserving document

stream visualization technique (2.5 docs / sec, 5 parallel processes)

Tricks: warm start, pipelining, parallelization Further work

Performance at different nQ and u? Optimize k-means (done!) and k-NN (easy) Find use cases, perform user studies

Decision making in financial domain (FIRST) Press clipping (media monitoring)


efficient visualization of document streams

Documents

highdimensional spacecanberra

number of times

number of documents

y1 minx

slovenia discovery science

logd dfnot possible

data quality

vid podpeanmatja juriprof