haesun park division of computational science and engineering georgia institute of technology
DESCRIPTION
FODAVA-Lead Research Dimension Reduction and Data Reduction: Foundations for Interactive Visualization. Haesun Park Division of Computational Science and Engineering Georgia Institute of Technology FODAVA Review Meeting, Dec. 2009. - PowerPoint PPT PresentationTRANSCRIPT
FODAVA-Lead ResearchDimension Reduction and Data Reduction:Foundations for Interactive Visualization
Haesun ParkDivision of Computational Science and
EngineeringGeorgia Institute of Technology
FODAVA Review Meeting, Dec. 2009
Challenges in Analyzing High Dimensional Massive Data
on Visual Analytics System
• Screen Space and Visual Perception: low dim and
number of available pixels fundamentally limiting constraints • High dimensional data: Effective dimension reduction • Large data sets: Informative representation of data
• Speed: necessary for real-time, interactive use• Scalable algorithms• Adaptive algorithms
Development of Fundamental Theory and Algorithms in Data Representations and Transformations to enable Visual Understanding
• Dimension Reduction • Dimension reduction with prior info/interpretability constraints• Manifold learning
• Informative Presentation of Large Scale Data• Sparse recovery by L1 penalty
• Clustering, semi-supervised clustering• Multi-resolution data approximation
• Fast Algorithms • Large-scale optimization/matrix decompositions• Adaptive updating algorithms for dynamic and time-varying data,
and interactive vis.
• Data Fusion • Fusion of different types of data from various sources• Fusion of different uncertainty level
• Integration with DAVA systems
FODAVA-Lead Research Topics
FODAVA-Lead Presentations
• H. Park – Overview of proposed FODAVA research, Introduction to FODAVA Test-bed, dimension reduction of clustered data for effective representation, application to text, image, and audio data sets
• A. Gray – Nonlinear dimension reduction (manifold learning), fast data analysis algorithms, formulation of problems as large scale optimization problems (SDP)
• V. Koltchinskii – Multiple kernel learning method for fusion of data with heterogeneous types, sparse representation
• R. Monteiro – Convex optimization, SDP, novel approach for dimension reduction, compressed sensing and sparse representation
• J. Stasko – Visual Analytics System demo, interplay between math/comp and interactive visualization
Test Bed for Visual Analytics of High Dimensional Massive Data
• Open source software• Integrates results from mathematics, statistics, numerical
algorithms/optimization across FODAVA teams• Easily accessible to a wide community of researchers
• Makes theory/algorithms relevant and readily available to VA and applications community
• Identifies effective methods for specific problems (evaluation)
FODAVAFundamentalResearch
ApplicationsApplications
Test BedTest Bed
Data Representation & Transformation Tasks
• Classification
• Clustering
• Regression
• Dimension reduction
• Density estimation
• Retrieval of similar items
• Automatic summarization
• …
Math
ematical , S
tatistical, and
C
om
pu
tation
al Meth
od
s
Modules in Data and Visual Analytics System for High Dimensional Massive Data
Visual Representation
and Interaction
Raw
D
ata
Data in
Inp
ut
Sp
ace
Analytical Reasoning
Vector Rep. of
Raw Data
• Text
• Image
• Audio …
Informative Representation and
Transformation
Visual Representation
• Dimension Reduction (2D/3D)
• Temporal Trend
•Uncertainty
• Anomaly/Outlier
• Causal relationship
• Zoom in/out by dynamic updating …
• Clustering
• Summarization
• Regression
• Multi-Resolution Data Reduction …
Label
Similarity
Density
Missing value …
Interactive Analysis
0
1
2
34
5
6
7
8
9
Modules in FODAVA Test Bed
Research in Data Representations and Transformations (by H. Park’s group)
• 2D/3D Representation of Data with Prior Information (J. Choo, J. Kim, K. Balasubramanian)• Clustered Data: Two-stage dimension reduction for clustered data• Nonnegative Data:
• Nonnegative Matrix Factorization (NMF)
• Nonnegative Tensor Factorization (NTF)
• Clustering and Classification (J. Kim, D. Kuang)
• New clustering algorithms based on NMF• Semi-supervised clustering based on NMF
• Sparse Representation of Data (J. Kim, V. Koltchinskii, R. Monteiro)
• Sparse Solution for Regression• Sparse PCA
• FODAVA Testbed Development (J. Choo, J. Kihm, H. Lee)
Nonnegativity Preserving Dim. Reduction Nonnegative Matrix Factorization (NMF) (Paatero&Tappa 94, Lee&Seung NATURE 99, Pauca et al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim and Park 06 Bioinformatics, Kim and Park 08 SIAM Journal on Matrix Analysis and Applications, …)
A W
H
~=
min || A – WH ||F
W>=0, H>=0
• Why Nonnegativity Constraints?•Better Approx. vs. Better Representation/Interpretation •Nonnegative Constraints often physically meaningful• Interpretation of analysis results possible
• Fastest Algorithm for NMF, with theoretical convergence (J. Kim and H. Park, IDCM08)
NMF/ANLS: Iterate the following with Active Set-type Method fixing W , solve minH>=0 || W H –A||F fixing H , solve minW>=0 || HT WT –AT||F
• Sparse NMF can be used as a clustering algorithm
2D RepresentationUtilize Cluster Structure if Known
2D representation of 700x1000 data with 7 clusters: LDA vs. SVD vs. PCA
LDA+PCA(2)
SVD(2) PCA(2)
High quality clusters have small trace(Sw) & large trace(Sb)
Want: F s.t. min trace(FT SwF) & max trace(FT Sb F)
• max trace ((FT SwF)-1 (FT Sb F)) LDA (Fisher 36, Rao 48), LDA/GSVD (Park, ..),
• max trace (FT Sb F) with FTF=I Orthogonal Centroid (Park et al. 03)
• max trace (FT (Sw+Sb )F) with FTF=I PCA (Hotelling 33) FTF=I
• max trace (FT A AT F) with FTF=I LSI (Deerwester et al. 90)
Can easily be non-linearized using Kernel functions
Optimal Reduced Dimension >> 3 in general
Optimal Dimension Reducing Transformation
trace (Sb )trace (Sw )
Two-stage Dimension Reduction for 2D Visualization of Clustered Data
• LDA + LDA = Rank2 LDA • LDA + PCA• OCM + PCA • OCM + Rank-2 PCA on SF
b = Rank-2 PCA on Sb (In-Spire)
(J. Choo, S. Bohn, H.Park, VAST 09)
m x nri1 2
... ... FTlxm
l x nri1 2
... ...m l
Sw , Sb SFw , SF
b
HT2xl
2 x nri1 2
... ...2
SHw , SH
b
First stage
PreserveCluster Structure
Second stage
Minimize InformationLoss
2D Visualization: Newsgroups
2D visualization of Newsgroups Data (21347 dimension, 770 items, 11 clusters)
g: talk.politics.gunsp: talk.politics.miscc: soc.religion.christianr: talk.religion.miscp: comp.sys.ibm.pc.hardwarea: comp.sys.mac.hardware
y: sci cryptd: sci.mede: sci.electronicsf: misc.forsaleb: rec.sport.baseball
Rank-2 LDA LDA + PCA
OCM + PCA Rank-2 PCA on Sb
2D Visualization of Clustered Text, Image, Audio Data
Medline Data (Text)
LDA+PCA
h : heart attackc : colon cancero : oral cancerd : diabetest : tooth decay
PCA
Rank-2 LDA
PCA
Rank-2 LDA
PCA
Facial Data (Image) Spoken Letters (Audio)
Weizmann Face Data• (352 * 512 pixels each) x (28 persons * 52 images each)• Significant variations in angle, illumination, and facial expression
Visual Facial Recognizer: A Test Bed Application
Problem• No data analytic algorithm alone is perfect.
e.g., Accuracy comparison
Accuracy
PCA 60%
LDA 75%
TensorFaces 69%
h-LDA 81%
• Visually reduce human’s search space→ Efficiently utilize human visual recognitione.g., Test bed visualization of Weizmann images using Rank-2 LDA
Visual Facial Recognizer: A Test Bed Application
Summary / Future Research• Informative 2D/3D Representation of Data
• Clustered Data: Two-stage dimension reduction methods effective for a wide range of problems• Interpretable Dimension Reduction for nonnegative data: NMF
• New clustering algorithms based on NMF• Semi-supervised clustering based on NMF• Extension to Tensors for Time-series Data
• Customized Fast Algorithms for 2D/3D Reduction needed• Dynamic Updating methods for Efficient and Interactive Visualization
• Sparse methods with L1 regularization• Sparse Solution for Regression• Sparse PCA
• FODAVA Test bed Development