efficient visualization of document streams
DESCRIPTION
Efficient Visualization of Document Streams. Miha Gr č ar { miha.grcar @ijs.si} Vid Podpe čan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010. Outline. Motivation - PowerPoint PPT PresentationTRANSCRIPT
Efficient Visualization of Document Streams
Miha Grčar {[email protected]}Vid PodpečanMatjaž Juršič
Prof. Dr. Nada Lavrač
Jozef Stefan Institute, Dept. of Knowledge TechnologiesLjubljana, Slovenia
Discovery Science, Canberra, October 2010
Outline
Motivation Original algorithm
Document corpus visualization pipeline Our modified algorithm
Visualization of document streams Experiments (speed tests) Conclusions and further work
Canberra, October 2010 DS 2010 2
MotivationVisualization of Document Corpora
Canberra, October 2010 DS 2010 3
MotivationGoal: Visualization of Document Streams
Documentstream
Outdateddocuments
Canberra, October 2010 DS 2010 4
Corpus Visualization Pipeline
Canberra, October 2010 DS 2010 5
Neighborhoodscomputation
Stressmajorization
Least-squaresinterpolation
k-means clustering
Corpus preprocessing
Documentcorpus
Layout
Paulovich et al. (2006)
Corpus Visualization Pipeline
Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
Tokenization Stop-word removal Lemmatization n-grams
Sparse TF-IDF vectors in a high-dimensional space
Canberra, October 2010 DS 2010 6
Corpus Visualization Pipeline
Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
Canberra, October 2010 DS 2010 7
Iterativemethod
Corpus Visualization Pipeline
Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
Canberra, October 2010 DS 2010 8
High-dimensional 2D
Iterativemethod
Corpus Visualization Pipeline
Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
Canberra, October 2010 DS 2010 9
Corpus Visualization Pipeline
Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
pi = (1/|Np|)rNpr
pi + rNp(–1/k)r = (0, 0),
k = |Np|
ci = (xi*, yi
*)
Canberra, October 2010 DS 2010 10
A X = B
(x1,y1)
(x2,y2)……
…
(xn-1,yn-1)
(xn,yn)
(0,0)
(0,0)
……
…
(0,0)
(0,0)
-1/k 1 -1/k -1/k
1
1
1
1
1
1
1
1
1
1
=
(x1*,y1*)……
(xr*,yr*)
1
1
1
1
Iterativemethod
arg minX{||AX – B||2}
Stream Visualization Pipeline
Canberra, October 2010 DS 2010 11
Neighborhoodscomputation
Stress majorization
Least-squaresinterpolation
k-means clustering
Preprocessing
Buffer(FIFO)
Documentstream
Outdateddocuments
Stream Visualization Pipeline
Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
TF-IDF weights TF: the number of
times the term occurs in the document
DF: the number of documents in the corpus containing the term
IDF: log(|D| / DF) Not possible to
compute IDF from (infinite) real-time streams
Canberra, October 2010 DS 2010 12
TF vector
Stream Visualization Pipeline
Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
Canberra, October 2010 DS 2010 13
TF vector
TF vector
TF vector
TF vector
TF vector
VocabularyDF values
TF-IDF vector
Stream Visualization Pipeline
Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
Canberra, October 2010 DS 2010 14
Warmstart!
Stream Visualization Pipeline
Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
Canberra, October 2010 DS 2010 15
Warmstart!
Stream Visualization Pipeline
Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
Canberra, October 2010 DS 2010 16
Remove outdated instances
Add new instances
…
Stream Visualization Pipeline
Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares
interpolation
Canberra, October 2010 DS 2010 17
(x1,y1)
(x2,y2)
……
(xn-1,yn-1)
(xn,yn)
(0,0)
(0,0)
……
(0,0)
(0,0)
1
1
1
1
1
1
1
1
1
1
=
(x1*,y1*)……
(xr*,yr*)
1
1
1
1
(x3,y3)
(x4,y4)
(0,0)
(0,0)
1 (x3,y3)
(x4,y4)……
…
(xn-1,yn-1)
(xn,yn)
(0,0)
(0,0)
……
…
(0,0)
(0,0)
1
1
1
1
1
1
1
1
1
1
=
(x1*,y1*)……
(xr*,yr*)
1
1
1
1
1
Remove outdated instances
Add new instances
Warmstart!
Speed Tests
First 30,000 news from Reuters Corpus Vol. 1 (“natural” rate: 1.4 news / minute)
Experimental setting Maximum rate? 10 news in a batch (u = 10) Buffer capacity: nQ = 5,000 news 100 control points, 30 + 30 neighbors
Canberra, October 2010 DS 2010 18
Speed Tests
Canberra, October 2010 DS 2010 19
Speed Tests
Canberra, October 2010 DS 2010 20
Neighborhoodscomputation
Stress majorization
Least-squaresinterpolation
k-means clustering
Preprocessing
Buffer(FIFO)
Documentstream
Outdateddocuments
Processing delay: ~9 sec
Exit delay: ~4 sec
Exit frequency: ~1 / 4 batches per sec (2.5 docs / sec)
+ 4 sec to form a batch
Speed Tests
Canberra, October 2010 DS 2010 21
Conclusions and Further Work
Conclusions Efficient online distance-preserving document
stream visualization technique (2.5 docs / sec, 5 parallel processes)
Tricks: warm start, pipelining, parallelization Further work
Performance at different nQ and u? Optimize k-means (done!) and k-NN (easy) Find use cases, perform user studies
Decision making in financial domain (FIRST) Press clipping (media monitoring)
Canberra, October 2010 DS 2010 22