Title: Fully Automatic Cross-Associations
1Fully Automatic Cross-Associations
- Deepayan Chakrabarti (CMU)
- Spiros Papadimitriou (CMU)
- Dharmendra Modha (IBM)
- Christos Faloutsos (CMU and IBM)
2Problem Definition
Simultaneously group customers and products,
or, documents and
words, or, users
and preferences
3Problem Definition
- Desiderata
- Simultaneously discover row and column groups
- Fully Automatic No magic numbers
- Scalable to large graphs
4Cross-Associations ? Co-clustering !
Information-theoretic co-clustering Cross-Associations
Lossy Compression. Approximates the original matrix, while trying to minimize KL-divergence. The number of row and column groups must be given by the user. Lossless Compression. Always provides complete information about the matrix, for any number of row and column groups. Chosen automatically using the MDL principle.
5Related Work
Dimensionality curse Choosing the number of
clusters
- K-means and variants
- Frequent itemsets
- Information Retrieval
- Graph Partitioning
User must specify support
Choosing the number of concepts
Number of partitions Measure of imbalance between
clusters
6What makes a cross-association good?
Why is this better?
- Similar nodes are grouped together
- As few groups as necessary
A few, homogeneous blocks
Better Clustering
Good Compression
implies
7Main Idea
Good Compression
Better Clustering
implies
Binary Matrix
pi1 ni1 / (ni1 ni0)
Cost of describing ni1 and ni0
Si
(ni1ni0) H(pi1)
Si
Description Cost
Code Cost
8Examples
high
low
high
low
m row group, n column group
9What makes a cross-association good?
Why is this better?
low
low
10Algorithms
l 5 col groups
k 5 row groups
11Algorithms
Find good groups for fixed k and l
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
12Fixed k and l
Find good groups for fixed k and l
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
13Fixed k and l
Swapsfor each row swap it to the row group
which minimizes the code cost
14Fixed k and l
Ditto for column swaps and repeat
15Choosing k and l
Find good groups for fixed k and l
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
16Choosing k and l
- Split
- Find the row group R with the maximum entropy per
row - Choose the rows in R whose removal reduces the
entropy per row in R - Send these rows to the new row group, and set
kk1
17Choosing k and l
Split Similar for column groups too.
18Algorithms
Find good groups for fixed k and l
Swaps
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
Splits
19Experiments
l 5 col groups
k 5 row groups
Customer-Product graph with Zipfian sizes, no
noise
20Experiments
l 8 col groups
k 6 row groups
Caveman graph with Zipfian cave sizes, noise10
21Experiments
l 3 col groups
k 2 row groups
White Noise graph
22Experiments
Documents
Words
CLASSIC graph of documents words k15, l19
23Experiments
NSF Grant Proposals
Words in abstract
GRANTS graph of documents words k41, l28
24Experiments
Epinions.com user
Epinions.com user
Who-trusts-whom graph from epinions.com k18,
l16
25Experiments
Users
Webpages
Clickstream graph of users and websites k15,
l13
26Experiments
Splits
Time (secs)
Swaps
Number of non-zeros
Linear on the number of ones Scalable
27Conclusions
- Desiderata
- Simultaneously discover row and column groups
- Fully Automatic No magic numbers
- Scalable to large graphs
28Fixed k and l
swaps
swaps
Find good groups for fixed k and l
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
29Experiments
l 5 col groups
k 5 row groups
Caveman graph with Zipfian cave sizes, no noise
30Aim
l 5 col groups
k 5 row groups
Given any binary matrix
a good cross-association will have low cost
But how can we find such a
cross-association?
31Main Idea
Good Compression
Better Clustering
implies
Minimize the total cost
32Main Idea
Good Compression
Better Clustering
implies
- How well does a cross-association compress the
matrix? - Encode the matrix in a lossless fashion
- Compute the encoding cost
- Low encoding cost ? good compression ? good
clustering