Fully Automatic Cross-Associations - PowerPoint PPT Presentation

About This Presentation
Title:

Fully Automatic Cross-Associations

Description:

Cross-Associations Co-clustering ! Lossless Compression. ... Final cross-associations. Lower the encoding cost. Find good groups for fixed k and l ... – PowerPoint PPT presentation

Number of Views:99
Avg rating:3.0/5.0
Slides: 33
Provided by: Deep2
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Fully Automatic Cross-Associations


1
Fully Automatic Cross-Associations
  • Deepayan Chakrabarti (CMU)
  • Spiros Papadimitriou (CMU)
  • Dharmendra Modha (IBM)
  • Christos Faloutsos (CMU and IBM)

2
Problem Definition
Simultaneously group customers and products,
or, documents and
words, or, users
and preferences
3
Problem Definition
  • Desiderata
  • Simultaneously discover row and column groups
  • Fully Automatic No magic numbers
  • Scalable to large graphs

4
Cross-Associations ? Co-clustering !
Information-theoretic co-clustering Cross-Associations
Lossy Compression. Approximates the original matrix, while trying to minimize KL-divergence. The number of row and column groups must be given by the user. Lossless Compression. Always provides complete information about the matrix, for any number of row and column groups. Chosen automatically using the MDL principle.
5
Related Work
Dimensionality curse Choosing the number of
clusters
  • K-means and variants
  • Frequent itemsets
  • Information Retrieval
  • Graph Partitioning

User must specify support
Choosing the number of concepts
Number of partitions Measure of imbalance between
clusters
6
What makes a cross-association good?
Why is this better?
  1. Similar nodes are grouped together
  2. As few groups as necessary

A few, homogeneous blocks
Better Clustering
Good Compression
implies
7
Main Idea
Good Compression
Better Clustering
implies
Binary Matrix
pi1 ni1 / (ni1 ni0)
Cost of describing ni1 and ni0
Si
(ni1ni0) H(pi1)
Si
Description Cost
Code Cost
8
Examples
high
low
high
low
m row group, n column group
9
What makes a cross-association good?
Why is this better?
low
low
10
Algorithms
l 5 col groups
k 5 row groups
11
Algorithms
Find good groups for fixed k and l
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
12
Fixed k and l
Find good groups for fixed k and l
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
13
Fixed k and l
Swapsfor each row swap it to the row group
which minimizes the code cost
14
Fixed k and l
Ditto for column swaps and repeat
15
Choosing k and l
Find good groups for fixed k and l
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
16
Choosing k and l
  • Split
  • Find the row group R with the maximum entropy per
    row
  • Choose the rows in R whose removal reduces the
    entropy per row in R
  • Send these rows to the new row group, and set
    kk1

17
Choosing k and l
Split Similar for column groups too.
18
Algorithms
Find good groups for fixed k and l
Swaps
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
Splits
19
Experiments
l 5 col groups
k 5 row groups
Customer-Product graph with Zipfian sizes, no
noise
20
Experiments
l 8 col groups
k 6 row groups
Caveman graph with Zipfian cave sizes, noise10
21
Experiments
l 3 col groups
k 2 row groups
White Noise graph
22
Experiments
Documents
Words
CLASSIC graph of documents words k15, l19
23
Experiments
NSF Grant Proposals
Words in abstract
GRANTS graph of documents words k41, l28
24
Experiments
Epinions.com user
Epinions.com user
Who-trusts-whom graph from epinions.com k18,
l16
25
Experiments
Users
Webpages
Clickstream graph of users and websites k15,
l13
26
Experiments
Splits
Time (secs)
Swaps
Number of non-zeros
Linear on the number of ones Scalable
27
Conclusions
  • Desiderata
  • Simultaneously discover row and column groups
  • Fully Automatic No magic numbers
  • Scalable to large graphs

28
Fixed k and l
swaps
swaps
Find good groups for fixed k and l
Start with initial matrix
Final cross-associations
Lower the encoding cost
Choose better values for k and l
29
Experiments
l 5 col groups
k 5 row groups
Caveman graph with Zipfian cave sizes, no noise
30
Aim
l 5 col groups
k 5 row groups
Given any binary matrix
a good cross-association will have low cost
But how can we find such a
cross-association?
31
Main Idea
Good Compression
Better Clustering
implies
Minimize the total cost
32
Main Idea
Good Compression
Better Clustering
implies
  • How well does a cross-association compress the
    matrix?
  • Encode the matrix in a lossless fashion
  • Compute the encoding cost
  • Low encoding cost ? good compression ? good
    clustering
Write a Comment
User Comments (0)
About PowerShow.com