haesun park division of computational science and engineering georgia institute of technology

FODAVA-Lead ResearchDimension Reduction and Data Reduction:Foundations for Interactive Visualization

Haesun ParkDivision of Computational Science and

EngineeringGeorgia Institute of Technology

FODAVA Review Meeting, Dec. 2009

Challenges in Analyzing High Dimensional Massive Data

on Visual Analytics System

• Screen Space and Visual Perception: low dim and

number of available pixels fundamentally limiting constraints • High dimensional data: Effective dimension reduction • Large data sets: Informative representation of data

• Speed: necessary for real-time, interactive use• Scalable algorithms• Adaptive algorithms

Development of Fundamental Theory and Algorithms in Data Representations and Transformations to enable Visual Understanding

• Dimension Reduction • Dimension reduction with prior info/interpretability constraints• Manifold learning

• Informative Presentation of Large Scale Data• Sparse recovery by L1 penalty

• Clustering, semi-supervised clustering• Multi-resolution data approximation

• Fast Algorithms • Large-scale optimization/matrix decompositions• Adaptive updating algorithms for dynamic and time-varying data,

and interactive vis.

• Data Fusion • Fusion of different types of data from various sources• Fusion of different uncertainty level

• Integration with DAVA systems

FODAVA-Lead Research Topics

FODAVA-Lead Presentations

• H. Park – Overview of proposed FODAVA research, Introduction to FODAVA Test-bed, dimension reduction of clustered data for effective representation, application to text, image, and audio data sets

• A. Gray – Nonlinear dimension reduction (manifold learning), fast data analysis algorithms, formulation of problems as large scale optimization problems (SDP)

• V. Koltchinskii – Multiple kernel learning method for fusion of data with heterogeneous types, sparse representation

• R. Monteiro – Convex optimization, SDP, novel approach for dimension reduction, compressed sensing and sparse representation

• J. Stasko – Visual Analytics System demo, interplay between math/comp and interactive visualization

Test Bed for Visual Analytics of High Dimensional Massive Data

• Open source software• Integrates results from mathematics, statistics, numerical

algorithms/optimization across FODAVA teams• Easily accessible to a wide community of researchers

• Makes theory/algorithms relevant and readily available to VA and applications community

• Identifies effective methods for specific problems (evaluation)

FODAVAFundamentalResearch

ApplicationsApplications

Test BedTest Bed

Data Representation & Transformation Tasks

• Classification

• Clustering

• Regression

• Dimension reduction

• Density estimation

• Retrieval of similar items

• Automatic summarization

• …

Math

ematical , S

tatistical, and

C

om

pu

tation

al Meth

od

s

Modules in Data and Visual Analytics System for High Dimensional Massive Data

Visual Representation

and Interaction

Raw

D

ata

Data in

Inp

ut

Sp

ace

Analytical Reasoning

Vector Rep. of

Raw Data

• Text

• Image

• Audio …

Informative Representation and

Transformation

Visual Representation

• Dimension Reduction (2D/3D)

• Temporal Trend

•Uncertainty

• Anomaly/Outlier

• Causal relationship

• Zoom in/out by dynamic updating …

• Clustering

• Summarization

• Regression

• Multi-Resolution Data Reduction …

Label

Similarity

Density

Missing value …

Interactive Analysis

0

1

2

34

5

6

7

8

9

Modules in FODAVA Test Bed

Research in Data Representations and Transformations (by H. Park’s group)

• 2D/3D Representation of Data with Prior Information (J. Choo, J. Kim, K. Balasubramanian)• Clustered Data: Two-stage dimension reduction for clustered data• Nonnegative Data:

• Nonnegative Matrix Factorization (NMF)

• Nonnegative Tensor Factorization (NTF)

• Clustering and Classification (J. Kim, D. Kuang)

• New clustering algorithms based on NMF• Semi-supervised clustering based on NMF

• Sparse Representation of Data (J. Kim, V. Koltchinskii, R. Monteiro)

• Sparse Solution for Regression• Sparse PCA

• FODAVA Testbed Development (J. Choo, J. Kihm, H. Lee)

Nonnegativity Preserving Dim. Reduction Nonnegative Matrix Factorization (NMF) (Paatero&Tappa 94, Lee&Seung NATURE 99, Pauca et al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim and Park 06 Bioinformatics, Kim and Park 08 SIAM Journal on Matrix Analysis and Applications, …)

A W

H

~=

min || A – WH ||F

W>=0, H>=0

• Why Nonnegativity Constraints?•Better Approx. vs. Better Representation/Interpretation •Nonnegative Constraints often physically meaningful• Interpretation of analysis results possible

• Fastest Algorithm for NMF, with theoretical convergence (J. Kim and H. Park, IDCM08)

NMF/ANLS: Iterate the following with Active Set-type Method fixing W , solve minH>=0 || W H –A||F fixing H , solve minW>=0 || HT WT –AT||F

• Sparse NMF can be used as a clustering algorithm

2D RepresentationUtilize Cluster Structure if Known

2D representation of 700x1000 data with 7 clusters: LDA vs. SVD vs. PCA

LDA+PCA(2)

SVD(2) PCA(2)

High quality clusters have small trace(Sw) & large trace(Sb)

Want: F s.t. min trace(FT SwF) & max trace(FT Sb F)

• max trace ((FT SwF)-1 (FT Sb F)) LDA (Fisher 36, Rao 48), LDA/GSVD (Park, ..),

• max trace (FT Sb F) with FTF=I Orthogonal Centroid (Park et al. 03)

• max trace (FT (Sw+Sb )F) with FTF=I PCA (Hotelling 33) FTF=I

• max trace (FT A AT F) with FTF=I LSI (Deerwester et al. 90)

Can easily be non-linearized using Kernel functions

Optimal Reduced Dimension >> 3 in general

Optimal Dimension Reducing Transformation

trace (Sb )trace (Sw )

Two-stage Dimension Reduction for 2D Visualization of Clustered Data

• LDA + LDA = Rank2 LDA • LDA + PCA• OCM + PCA • OCM + Rank-2 PCA on SF

b = Rank-2 PCA on Sb (In-Spire)

(J. Choo, S. Bohn, H.Park, VAST 09)

m x nri1 2

... ... FTlxm

l x nri1 2

... ...m l

Sw , Sb SFw , SF

b

HT2xl

2 x nri1 2

... ...2

SHw , SH

b

First stage

PreserveCluster Structure

Second stage

Minimize InformationLoss

2D Visualization: Newsgroups

2D visualization of Newsgroups Data (21347 dimension, 770 items, 11 clusters)

g: talk.politics.gunsp: talk.politics.miscc: soc.religion.christianr: talk.religion.miscp: comp.sys.ibm.pc.hardwarea: comp.sys.mac.hardware

y: sci cryptd: sci.mede: sci.electronicsf: misc.forsaleb: rec.sport.baseball

Rank-2 LDA LDA + PCA

OCM + PCA Rank-2 PCA on Sb

2D Visualization of Clustered Text, Image, Audio Data

Medline Data (Text)

LDA+PCA

h : heart attackc : colon cancero : oral cancerd : diabetest : tooth decay

PCA

Rank-2 LDA

PCA

Rank-2 LDA

PCA

Facial Data (Image) Spoken Letters (Audio)

Weizmann Face Data• (352 * 512 pixels each) x (28 persons * 52 images each)• Significant variations in angle, illumination, and facial expression

Visual Facial Recognizer: A Test Bed Application

Problem• No data analytic algorithm alone is perfect.

e.g., Accuracy comparison

Accuracy

PCA 60%

LDA 75%

TensorFaces 69%

h-LDA 81%

• Visually reduce human’s search space→ Efficiently utilize human visual recognitione.g., Test bed visualization of Weizmann images using Rank-2 LDA

Visual Facial Recognizer: A Test Bed Application

Summary / Future Research• Informative 2D/3D Representation of Data

• Clustered Data: Two-stage dimension reduction methods effective for a wide range of problems• Interpretable Dimension Reduction for nonnegative data: NMF

• New clustering algorithms based on NMF• Semi-supervised clustering based on NMF• Extension to Tensors for Time-series Data

• Customized Fast Algorithms for 2D/3D Reduction needed• Dynamic Updating methods for Efficient and Interactive Visualization

• Sparse methods with L1 regularization• Sparse Solution for Regression• Sparse PCA

• FODAVA Test bed Development

haesun park division of computational science and engineering georgia institute of technology

Documents

interactionraw data

fusion of data

data representations

audio data setsa

timevarying data

fast data analysis algorithms

effective representation

visual perception