visualization and navigation of document information spaces using a self-organizing map

33
Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map Daniel X. Pape Community Architectures for Network Information Systems [email protected] www.canis.uiuc.edu CSNA’98 6/18/98

Upload: rico

Post on 19-Jan-2016

38 views

Category:

Documents


2 download

DESCRIPTION

Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map. Daniel X. Pape Community Architectures for Network Information Systems [email protected] www.canis.uiuc.edu CSNA’98 6/18/98. Overview. Self-Organizing Map (SOM) Algorithm - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Visualization and Navigation of Document Information Spaces Using a

Self-Organizing Map

Daniel X. PapeCommunity Architectures for Network Information Systems

[email protected]

CSNA’98 6/18/98

Page 2: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Overview

• Self-Organizing Map (SOM) Algorithm

• U-Matrix Algorithm for SOM Visualization

• SOM Navigation Application

• Document Representation and Collection Examples

• Problems and Optimizations

• Future Work

Page 3: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Basic SOM Algorithm

• Input– Number (n) of Feature Vectors (x)– format:

vector name: a, b, c, d

– examples:1: 0.1, 0.2, 0.3, 0.4

2: 0.2, 0.3, 0.3, 0.2

Page 4: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Basic SOM Algorithm

• Output– Neural network Map of (M) Nodes– Each node has an associated Weight Vector (m)

of the same dimensionality as the input feature vectors

– Examples:m1: 0.1, 0.2, 0.3, 0.4

m2: 0.2, 0.3, 0.3, 0.2

Page 5: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Basic SOM Algorithm

• Output (cont.)– Nodes laid out in a grid:

Page 6: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Basic SOM Algorithm

• Other Parameters– Number of timesteps (T)– Learning Rate (eta)

Page 7: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Basic SOM AlgorithmSOM() {

foreach timestep t {

foreach feature vector fv {

wnode = find_winning_node(fv)

update_local_neighborhood(wnode)

}

}

}

find_winning_node() {

foreach node n {

compute distance of m to feature vector

}

return node with the smallest distance

}

update_local_neighborhood(wnode) {

foreach node n {

m = m + eta [x - m]

}

}

Page 8: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

U-Matrix Visualization

• Provides a simple way to visualize cluster boundaries on the map

• Simple algorithm:– for each node in the map, compute the average

of the distances between its weight vector and those of its immediate neighbors

• Average distance is a measure of a node’s similarity between it and its neighbors

Page 9: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

U-Matrix Visualization

• Interpretation– one can encode the U-Matrix measurements as

greyscale values in an image, or as altitudes on a terrain

– landscape that represents the document space: the valleys, or dark areas are the clusters of data, and the mountains, or light areas are the boundaries between the clusters

Page 10: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

U-Matrix Visualization

• Example:– dataset of random three dimensional points,

arranged in four obvious clusters

Page 11: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

U-Matrix Visualization

Four (color-coded) clusters of three-dimensional points

Page 12: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

U-Matrix Visualization

Oblique projection of a terrain derived from the U-Matrix

Page 13: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

U-Matrix Visualization

Terrain for a real document collection

Page 14: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Current Labeling Procedure

• Feature vectors are encoded as 0’s and 1’s

• Weight vectors have real values from 0 to 1

• Sort weight vector dimensions by element value– dimension with greatest value is “best” noun

phrase for that node

• Aggregate nodes with the same “best” noun phrase into groups

Page 15: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Umatrix Navigation

• 3D Space-Flight

• Hierarchical Navigation

Page 16: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Document Data

• Noun phrases extracted

• Set of unique noun phrases computed– each noun phrase becomes a dimension of the

data set

• Each document represented by a binary vector with a 1 or a 0 denoting the existence or absence of each noun phrase

Page 17: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Document Data

• Example:– 10 total noun phrases:

alexander, king, macedonians, darius, philip, horse, soldiers, battle, army, death

– each element of the feature vector will be a 1 or a 0:

• 1: 1, 1, 0, 0, 1, 1, 0, 0, 0, 0

• 2: 0, 1, 0, 1, 0, 0, 1, 1, 1, 1

Page 18: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Document Collection Examples

Number ofDocuments

Number of NounPhrases

Execution Time

Biosis 1,194 2,032 17 days

Ancien-l 6,703 34,486 66 days

Compendex 162,338 22,324 ~3.4 years

Cancerlit 624,674 16,882 ~12.1 years

Page 19: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Problems

• As document sets get larger, the feature vectors get longer, use more memory, etc.

• Execution time grows to unrealistic lengths

Page 20: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Solutions?

• Need algorithm refinements for sparse feature vectors

• Need a faster way to do the find_winning_node() computation

• Need a better way to do the update_local_neighborhood() computation

Page 21: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Sparse Vector Optimization

• Intelligent support for sparse feature vectors– saves on memory usage– greatly improves speed of the weight vector

update computation

Page 22: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Faster find_winning_node()

• SOM weight vectors become partially ordered very quickly

Page 23: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Faster find_winning_node()

U-Matrix Visualization of an Initial, Unordered SOM

Page 24: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Faster find_winning_node()

Partially Ordered SOM after 5 timesteps

Page 25: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Faster find_winning_node()

• Don’t do a global search for the winner

• Start search from last known winner position

• Pro:– usually finds a new winner very quickly

• Con:– this new search for a winner can sometimes get

stuck in a local minima

Page 26: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Better Neighborhood Update

• Nodes get told to “update” quite often

• Weight vector is made public only during a find_winner() search

• With local find_winning_node() search, a lazy neighborhood weight vector update can be performed

Page 27: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Better Neighborhood Update

• Cache update requests– each node will store the winning node and

feature vector for each update request

• The node performs the update computations called for by the stored update requests only when asked for its weight vector

• Possible reduction of number of requests by averaging the feature vectors in the cache

Page 28: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

New Execution Times

Execution Time Speedup

Biosis 2.3 hours 180x

Ancien-l 10.2 hours 160x

Compendex ~8.4 days 150x

Cancerlit ~ 1 month 150x

Page 29: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Future Work

• Parallelization

• Label Problem

Page 30: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Label Problem

• Current Procedure not very good

• Cluster boundaries

• Term selection

Page 31: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Cluster Boundaries

• Image processing

• Geometric

Page 32: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Cluster Boundaries

• Image processing example:

Page 33: Visualization and Navigation of Document Information Spaces Using a Self-Organizing Map

Term Selection

• Too many unique noun phrases– Too many dimensions in the feature vector data

• “Knee” of frequency curve