efficient visualization of document streams

22
Efficient Visualization of Document Streams Miha Gr č ar {[email protected]} Vid Podpečan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010

Upload: george-guthrie

Post on 30-Dec-2015

36 views

Category:

Documents


1 download

DESCRIPTION

Efficient Visualization of Document Streams. Miha Gr č ar { miha.grcar @ijs.si} Vid Podpe čan Matjaž Juršič Prof. Dr. Nada Lavrač Jozef Stefan Institute, Dept. of Knowledge Technologies Ljubljana, Slovenia Discovery Science, Canberra, October 2010. Outline. Motivation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficient Visualization of Document Streams

Efficient Visualization of Document Streams

Miha Grčar {[email protected]}Vid PodpečanMatjaž Juršič

Prof. Dr. Nada Lavrač

Jozef Stefan Institute, Dept. of Knowledge TechnologiesLjubljana, Slovenia

Discovery Science, Canberra, October 2010

Page 2: Efficient Visualization of Document Streams

Outline

Motivation Original algorithm

Document corpus visualization pipeline Our modified algorithm

Visualization of document streams Experiments (speed tests) Conclusions and further work

Canberra, October 2010 DS 2010 2

Page 3: Efficient Visualization of Document Streams

MotivationVisualization of Document Corpora

Canberra, October 2010 DS 2010 3

Page 4: Efficient Visualization of Document Streams

MotivationGoal: Visualization of Document Streams

Documentstream

Outdateddocuments

Canberra, October 2010 DS 2010 4

Page 5: Efficient Visualization of Document Streams

Corpus Visualization Pipeline

Canberra, October 2010 DS 2010 5

Neighborhoodscomputation

Stressmajorization

Least-squaresinterpolation

k-means clustering

Corpus preprocessing

Documentcorpus

Layout

Paulovich et al. (2006)

Page 6: Efficient Visualization of Document Streams

Corpus Visualization Pipeline

Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Tokenization Stop-word removal Lemmatization n-grams

Sparse TF-IDF vectors in a high-dimensional space

Canberra, October 2010 DS 2010 6

Page 7: Efficient Visualization of Document Streams

Corpus Visualization Pipeline

Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Canberra, October 2010 DS 2010 7

Iterativemethod

Page 8: Efficient Visualization of Document Streams

Corpus Visualization Pipeline

Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Canberra, October 2010 DS 2010 8

High-dimensional 2D

Iterativemethod

Page 9: Efficient Visualization of Document Streams

Corpus Visualization Pipeline

Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Canberra, October 2010 DS 2010 9

Page 10: Efficient Visualization of Document Streams

Corpus Visualization Pipeline

Corpus preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

pi = (1/|Np|)rNpr

pi + rNp(–1/k)r = (0, 0),

k = |Np|

ci = (xi*, yi

*)

Canberra, October 2010 DS 2010 10

A X = B

(x1,y1)

(x2,y2)……

(xn-1,yn-1)

(xn,yn)

(0,0)

(0,0)

……

(0,0)

(0,0)

-1/k 1 -1/k -1/k

1

1

1

1

1

1

1

1

1

1

=

(x1*,y1*)……

(xr*,yr*)

1

1

1

1

Iterativemethod

arg minX{||AX – B||2}

Page 11: Efficient Visualization of Document Streams

Stream Visualization Pipeline

Canberra, October 2010 DS 2010 11

Neighborhoodscomputation

Stress majorization

Least-squaresinterpolation

k-means clustering

Preprocessing

Buffer(FIFO)

Documentstream

Outdateddocuments

Page 12: Efficient Visualization of Document Streams

Stream Visualization Pipeline

Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

TF-IDF weights TF: the number of

times the term occurs in the document

DF: the number of documents in the corpus containing the term

IDF: log(|D| / DF) Not possible to

compute IDF from (infinite) real-time streams

Canberra, October 2010 DS 2010 12

Page 13: Efficient Visualization of Document Streams

TF vector

Stream Visualization Pipeline

Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Canberra, October 2010 DS 2010 13

TF vector

TF vector

TF vector

TF vector

TF vector

VocabularyDF values

TF-IDF vector

Page 14: Efficient Visualization of Document Streams

Stream Visualization Pipeline

Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Canberra, October 2010 DS 2010 14

Warmstart!

Page 15: Efficient Visualization of Document Streams

Stream Visualization Pipeline

Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Canberra, October 2010 DS 2010 15

Warmstart!

Page 16: Efficient Visualization of Document Streams

Stream Visualization Pipeline

Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Canberra, October 2010 DS 2010 16

Remove outdated instances

Add new instances

Page 17: Efficient Visualization of Document Streams

Stream Visualization Pipeline

Preprocessing k-means clustering Stress majorization Neighborhoods Least-squares

interpolation

Canberra, October 2010 DS 2010 17

(x1,y1)

(x2,y2)

……

(xn-1,yn-1)

(xn,yn)

(0,0)

(0,0)

……

(0,0)

(0,0)

1

1

1

1

1

1

1

1

1

1

=

(x1*,y1*)……

(xr*,yr*)

1

1

1

1

(x3,y3)

(x4,y4)

(0,0)

(0,0)

1 (x3,y3)

(x4,y4)……

(xn-1,yn-1)

(xn,yn)

(0,0)

(0,0)

……

(0,0)

(0,0)

1

1

1

1

1

1

1

1

1

1

=

(x1*,y1*)……

(xr*,yr*)

1

1

1

1

1

Remove outdated instances

Add new instances

Warmstart!

Page 18: Efficient Visualization of Document Streams

Speed Tests

First 30,000 news from Reuters Corpus Vol. 1 (“natural” rate: 1.4 news / minute)

Experimental setting Maximum rate? 10 news in a batch (u = 10) Buffer capacity: nQ = 5,000 news 100 control points, 30 + 30 neighbors

Canberra, October 2010 DS 2010 18

Page 19: Efficient Visualization of Document Streams

Speed Tests

Canberra, October 2010 DS 2010 19

Page 20: Efficient Visualization of Document Streams

Speed Tests

Canberra, October 2010 DS 2010 20

Neighborhoodscomputation

Stress majorization

Least-squaresinterpolation

k-means clustering

Preprocessing

Buffer(FIFO)

Documentstream

Outdateddocuments

Processing delay: ~9 sec

Exit delay: ~4 sec

Exit frequency: ~1 / 4 batches per sec (2.5 docs / sec)

+ 4 sec to form a batch

Page 21: Efficient Visualization of Document Streams

Speed Tests

Canberra, October 2010 DS 2010 21

Page 22: Efficient Visualization of Document Streams

Conclusions and Further Work

Conclusions Efficient online distance-preserving document

stream visualization technique (2.5 docs / sec, 5 parallel processes)

Tricks: warm start, pipelining, parallelization Further work

Performance at different nQ and u? Optimize k-means (done!) and k-NN (easy) Find use cases, perform user studies

Decision making in financial domain (FIRST) Press clipping (media monitoring)

Canberra, October 2010 DS 2010 22