Download - Fully Automatic Cross-Associations
Fully Automatic Cross-Associations
Deepayan Chakrabarti (CMU)Spiros Papadimitriou (CMU)Dharmendra Modha (IBM)Christos Faloutsos (CMU and IBM)
Problem Definition
Products
Cus
tom
ers
Cus
tom
er G
roup
s
Product Groups
Simultaneously group customers and products, or, documents and words, or, users and preferences …
Problem Definition
Desiderata:
1. Simultaneously discover row and column groups
2. Fully Automatic: No “magic numbers”
3. Scalable to large graphs
Cross-Associations ≠ Co-clustering !Information-theoretic
co-clustering Cross-Associations
1. Lossy Compression.
2. Approximates the original matrix, while trying to minimize KL-divergence.
3. The number of row and column groups must be given by the user.
1. Lossless Compression.
2. Always provides complete information about the matrix, for any number of row and column groups.
3. Chosen automatically using the MDL principle.
Related Work
K-means and variants:
“Frequent itemsets”:
Information Retrieval:
Graph Partitioning:
Dimensionality curse
Choosing the number of clusters
User must specify “support”
Choosing the number of “concepts”
Number of partitions
Measure of imbalance between clusters
What makes a cross-association “good”?
versus
Column groups Column groups
Row
gro
ups
Row
gro
ups
Better Clustering
1. Similar nodes are grouped together
2. As few groups as necessary
A few, homogeneous
blocks
Good Compression
Why is this better?
implies
Main Idea
Good Compression
Better Clusteringimplies
Column groups
Row
gro
ups
pi1 = ni
1 / (ni1 + ni
0)
(ni1+ni
0)* H(pi1) Cost of describing
ni1 and ni
0
Code Cost Description Cost
Σi
Binary Matrix
+Σi
Examples
One row group, one column group
high low
m row group, n column group
highlow
Total Encoding Cost = (ni1+ni
0)* H(pi1) Cost of describing
ni1 and ni
0
Code Cost Description Cost
Σi +Σi
What makes a cross-association “good”?
versus
Column groups Column groups
Row
gro
ups
Row
gro
ups Why is this
better?
low low
Total Encoding Cost = (ni1+ni
0)* H(pi1) Cost of describing
ni1 and ni
0
Code Cost Description Cost
Σi +Σi
Algorithmsk =
5 row groups
k=1, l=2
k=2, l=2
k=2, l=3
k=3, l=3
k=3, l=4
k=4, l=4
k=4, l=5
l = 5 col groups
Algorithmsl = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-associations
Lower the encoding cost
Fixed k and ll = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-associations
Lower the encoding cost
Fixed k and l
Column groups
Row
gro
ups Swaps:
for each row:
swap it to the row group which minimizes the code cost
Fixed k and l
Column groups
Row
gro
ups
Ditto for column swaps
… and repeat …
Choosing k and ll = 5
k = 5
Start with initial matrix
Choose better values for k and l
Final cross-associations
Lower the encoding cost
Find good groups for fixed k and l
Choosing k and ll = 5
k = 5
Split:1. Find the row group R with the maximum entropy per row
2. Choose the rows in R whose removal reduces the entropy per row in R
3. Send these rows to the new row group, and set k=k+1
Choosing k and ll = 5
k = 5
Split:
Similar for column groups too.
Algorithmsl = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-associations
Lower the encoding cost
Swaps
Splits
Experimentsl = 5 col groups
k = 5 row
groups
“Customer-Product” graph with Zipfian sizes, no noise
Experiments
“Caveman” graph with Zipfian cave sizes, noise=10%
l = 8 col groups
k = 6 row
groups
Experiments
“White Noise” graph
l = 3 col groups
k = 2 row
groups
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
Doc
umen
ts
Words
ExperimentsN
SF
Gra
nt P
ropo
sals
Words in abstract
“GRANTS” graph of documents & words: k=41, l=28
Experiments
“Who-trusts-whom” graph from epinions.com: k=18, l=16
Epi
nion
s.co
m u
ser
Epinions.com user
Experiments
“Clickstream” graph of users and websites: k=15, l=13
Use
rs
Webpages
Experiments
Number of non-zeros
Tim
e (
secs
)
Splits
Swaps
Linear on the number of “ones”: Scalable
Conclusions
Desiderata:
Simultaneously discover row and column groups
Fully Automatic: No “magic numbers”
Scalable to large graphs
Fixed k and ll = 5
k = 5
Start with initial matrix
Choose better values for k and l
Final cross-associations
Lower the encoding cost
Find good groups for fixed k and l
swaps swaps
Experimentsl = 5 col groups
k = 5 row
groups
“Caveman” graph with Zipfian cave sizes, no noise
Aim
Given any binary matrix a “good” cross-association will have low cost
But how can we find such a cross-association?
l = 5 col groups
k = 5 row
groups
Main Idea
sizei * H(pi) +Cost of describing cross-associations
Code Cost Description Cost
Σi Total Encoding Cost =
Good Compression
Better Clusteringimplies
Minimize the total cost
Main Idea
How well does a cross-association compress the matrix? Encode the matrix in a lossless fashion Compute the encoding cost Low encoding cost good compression good
clustering
Good Compression
Better Clusteringimplies