fully automatic cross-associations
DESCRIPTION
Fully Automatic Cross-Associations. Deepayan Chakrabarti (CMU) Spiros Papadimitriou (CMU) Dharmendra Modha (IBM) Christos Faloutsos (CMU and IBM). Customers. Customer Groups. Products. Product Groups. Problem Definition. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/1.jpg)
Fully Automatic Cross-Associations
Deepayan Chakrabarti (CMU)Spiros Papadimitriou (CMU)Dharmendra Modha (IBM)Christos Faloutsos (CMU and IBM)
![Page 2: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/2.jpg)
Problem Definition
Products
Cus
tom
ers
Cus
tom
er G
roup
s
Product Groups
Simultaneously group customers and products, or, documents and words, or, users and preferences …
![Page 3: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/3.jpg)
Problem Definition
Desiderata:
1. Simultaneously discover row and column groups
2. Fully Automatic: No “magic numbers”
3. Scalable to large graphs
![Page 4: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/4.jpg)
Cross-Associations ≠ Co-clustering !Information-theoretic
co-clustering Cross-Associations
1. Lossy Compression.
2. Approximates the original matrix, while trying to minimize KL-divergence.
3. The number of row and column groups must be given by the user.
1. Lossless Compression.
2. Always provides complete information about the matrix, for any number of row and column groups.
3. Chosen automatically using the MDL principle.
![Page 5: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/5.jpg)
Related Work
K-means and variants:
“Frequent itemsets”:
Information Retrieval:
Graph Partitioning:
Dimensionality curse
Choosing the number of clusters
User must specify “support”
Choosing the number of “concepts”
Number of partitions
Measure of imbalance between clusters
![Page 6: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/6.jpg)
What makes a cross-association “good”?
versus
Column groups Column groups
Row
gro
ups
Row
gro
ups
Better Clustering
1. Similar nodes are grouped together
2. As few groups as necessary
A few, homogeneous
blocks
Good Compression
Why is this better?
implies
![Page 7: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/7.jpg)
Main Idea
Good Compression
Better Clusteringimplies
Column groups
Row
gro
ups
pi1 = ni
1 / (ni1 + ni
0)
(ni1+ni
0)* H(pi1) Cost of describing
ni1 and ni
0
Code Cost Description Cost
Σi
Binary Matrix
+Σi
![Page 8: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/8.jpg)
Examples
One row group, one column group
high low
m row group, n column group
highlow
Total Encoding Cost = (ni1+ni
0)* H(pi1) Cost of describing
ni1 and ni
0
Code Cost Description Cost
Σi +Σi
![Page 9: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/9.jpg)
What makes a cross-association “good”?
versus
Column groups Column groups
Row
gro
ups
Row
gro
ups Why is this
better?
low low
Total Encoding Cost = (ni1+ni
0)* H(pi1) Cost of describing
ni1 and ni
0
Code Cost Description Cost
Σi +Σi
![Page 10: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/10.jpg)
Algorithmsk =
5 row groups
k=1, l=2
k=2, l=2
k=2, l=3
k=3, l=3
k=3, l=4
k=4, l=4
k=4, l=5
l = 5 col groups
![Page 11: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/11.jpg)
Algorithmsl = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-associations
Lower the encoding cost
![Page 12: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/12.jpg)
Fixed k and ll = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-associations
Lower the encoding cost
![Page 13: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/13.jpg)
Fixed k and l
Column groups
Row
gro
ups Swaps:
for each row:
swap it to the row group which minimizes the code cost
![Page 14: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/14.jpg)
Fixed k and l
Column groups
Row
gro
ups
Ditto for column swaps
… and repeat …
![Page 15: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/15.jpg)
Choosing k and ll = 5
k = 5
Start with initial matrix
Choose better values for k and l
Final cross-associations
Lower the encoding cost
Find good groups for fixed k and l
![Page 16: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/16.jpg)
Choosing k and ll = 5
k = 5
Split:1. Find the row group R with the maximum entropy per row
2. Choose the rows in R whose removal reduces the entropy per row in R
3. Send these rows to the new row group, and set k=k+1
![Page 17: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/17.jpg)
Choosing k and ll = 5
k = 5
Split:
Similar for column groups too.
![Page 18: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/18.jpg)
Algorithmsl = 5
k = 5
Start with initial matrix
Find good groups for fixed k and l
Choose better values for k and l
Final cross-associations
Lower the encoding cost
Swaps
Splits
![Page 19: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/19.jpg)
Experimentsl = 5 col groups
k = 5 row
groups
“Customer-Product” graph with Zipfian sizes, no noise
![Page 20: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/20.jpg)
Experiments
“Caveman” graph with Zipfian cave sizes, noise=10%
l = 8 col groups
k = 6 row
groups
![Page 21: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/21.jpg)
Experiments
“White Noise” graph
l = 3 col groups
k = 2 row
groups
![Page 22: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/22.jpg)
Experiments
“CLASSIC” graph of documents & words: k=15, l=19
Doc
umen
ts
Words
![Page 23: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/23.jpg)
ExperimentsN
SF
Gra
nt P
ropo
sals
Words in abstract
“GRANTS” graph of documents & words: k=41, l=28
![Page 24: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/24.jpg)
Experiments
“Who-trusts-whom” graph from epinions.com: k=18, l=16
Epi
nion
s.co
m u
ser
Epinions.com user
![Page 25: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/25.jpg)
Experiments
“Clickstream” graph of users and websites: k=15, l=13
Use
rs
Webpages
![Page 26: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/26.jpg)
Experiments
Number of non-zeros
Tim
e (
secs
)
Splits
Swaps
Linear on the number of “ones”: Scalable
![Page 27: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/27.jpg)
Conclusions
Desiderata:
Simultaneously discover row and column groups
Fully Automatic: No “magic numbers”
Scalable to large graphs
![Page 28: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/28.jpg)
Fixed k and ll = 5
k = 5
Start with initial matrix
Choose better values for k and l
Final cross-associations
Lower the encoding cost
Find good groups for fixed k and l
swaps swaps
![Page 29: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/29.jpg)
Experimentsl = 5 col groups
k = 5 row
groups
“Caveman” graph with Zipfian cave sizes, no noise
![Page 30: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/30.jpg)
Aim
Given any binary matrix a “good” cross-association will have low cost
But how can we find such a cross-association?
l = 5 col groups
k = 5 row
groups
![Page 31: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/31.jpg)
Main Idea
sizei * H(pi) +Cost of describing cross-associations
Code Cost Description Cost
Σi Total Encoding Cost =
Good Compression
Better Clusteringimplies
Minimize the total cost
![Page 32: Fully Automatic Cross-Associations](https://reader035.vdocuments.us/reader035/viewer/2022062519/56814f98550346895dbd5553/html5/thumbnails/32.jpg)
Main Idea
How well does a cross-association compress the matrix? Encode the matrix in a lossless fashion Compute the encoding cost Low encoding cost good compression good
clustering
Good Compression
Better Clusteringimplies