algebraic techniques for analysis of large discrete-valued datasets
DESCRIPTION
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets. Mehmet Koyuturk 1 , Ananth Grama 1 , and Naren Ramakrishnan 2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech [email protected]. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
Mehmet Koyuturk1, Ananth Grama1, and Naren
Ramakrishnan2
1. Dept. of Computer Sciences, Purdue University
{koyuturk, ayg} @cs.purdue.edu
2. Dept. of Computer Sciences, Virginia Tech
Motivation Handling large discrete-valued
datasets Extracting relations between data
items Summarizing data in an error-bounded
fashion Clustering of data items Finding coinsize representations for
clustered data
Background
Singular Value Decomposition (SVD) [Berry et.al., 1995] Decompose matrix into A=UVT
U and V orthogonal matrices, diagonal with singular values
Used for Latent Semantic Indexing in Information Retrieval
Truncate decomposition to compress data
Background Semi-Discrete Decomposition (SDD)
[Kolda and O’Leary, 1998] Restrict entries of U and V to {-1,0,1} Requires very small amount of storage Can perform as well as SVD in LSI using
less than one-tenth the storage Effective in finding outlier clusters
works well for datasets containing a large number of small clusters
Rank-1 Approximation
TxyA
011
1
1
1
011
011
011x : presence vectory : pattern vector
10100
00000
10100
10100
10100
1
0
1
1
10101
11000
10100
10110
A
Rank-1 Approximation
}1:{ ijT
ijF
T aa xyAxyA
Problem: Given discrete matrix Amxn , find discrete vectors xmx1 and ynx1 to Minimize
= number of non-zeros in the error matrix
Heuristic: Fix y, set 2
2y
Ays solve for x to Maximize
2
2x
sxT
Iteratively solve for x and y until no improvement possible
Initialization of pattern vector
Crucial to escape from local optima Must require at most (nz(A)) time, not to Some possible schemes
AllOnes: Set all entries to 1, poor. Threshold: Set only the entries that have
corresponding columns with # of non-zeros more than a threshold. Can lead to bad local optima.
Maximum: Set only the entry that corresponds to the column with max. # of non-zeros. Risky, that column may be shared by lots of patterns.
Partition: Partition the rows of matrix based on a column, than apply threshold scheme taking into account only one of the parts. Best among these.
Recursive Algorithm
- if x(i)=1 row i goes to A1
- At any step, given rank-one approximation AxyT, split A to A1 and A0 based on rows :
- if x(i)=0 row i goes to A0
- Stop when- hamming radius of A1, maximum of the hamming distances of A1pattern vector, is less then some threshold-all rows of A are present in A1
(if A1does not satisfy hamming radius condition, can split A1
based on hamming distances)
Recursive Algorithm
Effectiveness of Analysis
Effectiveness of Analysis
Run-time Scalability-Rank-1 approximation requires O(nz(A)) time -Total run-time at each level in the recursive tree cannot exceed this since total # of nonzeros at each level is at most nz(A) Run-time is linear in nz(A)
runtime vs # columns runtime vs # rows runtime vs # nonzeros
Conclusions and Ongoing Work Proposed algorithm is
Scalable to exteremely high-dimensions Effective in discovering dominant patterns Hierarchical in nature, allowing multi-
resolution analysis Currently working on
Real-world applications of proposed method Effective initialization schemes Parallel implementation