algebraic techniques for analysis of large discrete-valued datasets

13
Algebraic Techniques for Analysis of Large Discrete- Valued Datasets Mehmet Koyuturk 1 , Ananth Grama 1 , and Naren Ramakrishnan 2 1. Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech [email protected]

Upload: maisie

Post on 06-Jan-2016

20 views

Category:

Documents


1 download

DESCRIPTION

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets. Mehmet Koyuturk 1 , Ananth Grama 1 , and Naren Ramakrishnan 2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech [email protected]. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Mehmet Koyuturk1, Ananth Grama1, and Naren

Ramakrishnan2

1. Dept. of Computer Sciences, Purdue University

{koyuturk, ayg} @cs.purdue.edu

2. Dept. of Computer Sciences, Virginia Tech

[email protected]

Page 2: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Motivation Handling large discrete-valued

datasets Extracting relations between data

items Summarizing data in an error-bounded

fashion Clustering of data items Finding coinsize representations for

clustered data

Page 3: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Background

Singular Value Decomposition (SVD) [Berry et.al., 1995] Decompose matrix into A=UVT

U and V orthogonal matrices, diagonal with singular values

Used for Latent Semantic Indexing in Information Retrieval

Truncate decomposition to compress data

Page 4: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Background Semi-Discrete Decomposition (SDD)

[Kolda and O’Leary, 1998] Restrict entries of U and V to {-1,0,1} Requires very small amount of storage Can perform as well as SVD in LSI using

less than one-tenth the storage Effective in finding outlier clusters

works well for datasets containing a large number of small clusters

Page 5: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Rank-1 Approximation

TxyA

011

1

1

1

011

011

011x : presence vectory : pattern vector

10100

00000

10100

10100

10100

1

0

1

1

10101

11000

10100

10110

A

Page 6: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Rank-1 Approximation

}1:{ ijT

ijF

T aa xyAxyA

Problem: Given discrete matrix Amxn , find discrete vectors xmx1 and ynx1 to Minimize

= number of non-zeros in the error matrix

Heuristic: Fix y, set 2

2y

Ays solve for x to Maximize

2

2x

sxT

Iteratively solve for x and y until no improvement possible

Page 7: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Initialization of pattern vector

Crucial to escape from local optima Must require at most (nz(A)) time, not to Some possible schemes

AllOnes: Set all entries to 1, poor. Threshold: Set only the entries that have

corresponding columns with # of non-zeros more than a threshold. Can lead to bad local optima.

Maximum: Set only the entry that corresponds to the column with max. # of non-zeros. Risky, that column may be shared by lots of patterns.

Partition: Partition the rows of matrix based on a column, than apply threshold scheme taking into account only one of the parts. Best among these.

Page 8: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Recursive Algorithm

- if x(i)=1 row i goes to A1

- At any step, given rank-one approximation AxyT, split A to A1 and A0 based on rows :

- if x(i)=0 row i goes to A0

- Stop when- hamming radius of A1, maximum of the hamming distances of A1pattern vector, is less then some threshold-all rows of A are present in A1

(if A1does not satisfy hamming radius condition, can split A1

based on hamming distances)

Page 9: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Recursive Algorithm

Page 10: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Effectiveness of Analysis

Page 11: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Effectiveness of Analysis

Page 12: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Run-time Scalability-Rank-1 approximation requires O(nz(A)) time -Total run-time at each level in the recursive tree cannot exceed this since total # of nonzeros at each level is at most nz(A) Run-time is linear in nz(A)

runtime vs # columns runtime vs # rows runtime vs # nonzeros

Page 13: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets

Conclusions and Ongoing Work Proposed algorithm is

Scalable to exteremely high-dimensions Effective in discovering dominant patterns Hierarchical in nature, allowing multi-

resolution analysis Currently working on

Real-world applications of proposed method Effective initialization schemes Parallel implementation