Correlation Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

Correlation Clustering

Description:

Dopt Maximum fractional packing of erroneous triangles ... If several edge-disjoint erroneous s, then any clustering makes a mistake on each one ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 27
Provided by: Shuchi2
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Correlation Clustering


1
Correlation Clustering
  • Shuchi Chawla
  • Carnegie Mellon University
  • Joint work with
  • Nikhil Bansal and Avrim Blum

2
Natural Language Processing
Co-reference Analysis
  • In order to understand the article automatically,
    need to figure out which entities are one and the
    same
  • Is his in the second line the same person as
    The secretary in the first line?

3
Other real-world clustering problems
  • Web Document Clustering
  • Given a bunch of documents, classify them into
    salient topics
  • Computer Vision
  • Distinguish boundaries between different objects
    and the background in a picture
  • Research Communities
  • Given data on research papers, divide
    researchers into communities by co-authorship
  • Authorship (Citeseer/DBLP)
  • Given authors of documents, figure out which
    authors are really the same person

4
Traditional Approaches to Clustering
  • Approximation algorithms
  • k-means, k-median, k-min sum
  • Matrix methods
  • Spectral Clustering
  • AI techniques
  • EM, single-linkage, classification algorithms

5
Issues with traditional approaches
  • Dependence on underlying metric
  • Objective functions are meaningful only on a
    metric eg. k-means
  • Some algorithms work only for specific metrics
    (such as Euclidean)
  • Problem
  • No well-defined similarity metric
  • inconsistencies in beliefs

6
Issues with traditional approaches
  • Fixed number of clusters/known topics
  • Meaningless without prespecified number of
    clusters
  • eg. for k-means or k-median, if k is
    unspecified, it is best to put everything in
    their own cluster
  • Problem
  • Number of clusters is usually unknown
  • No predefined topics desirable to figure them
    out as part of the algorithm

7
Issues with traditional approaches
  • No clean notion of quality of clustering
  • Approximations do not directly translate to how
    many items have been grouped wrongly
  • Reliance on generative model
  • eg. Data arising from a mixture of Gaussians
  • Typically dont work well in the case of fuzzy
    boundaries
  • Problem
  • Fuzzy boundaries how to cluster may depends
    on the given set of objects

8
Cohen, McCallum Richmans idea
  • Learn a similarity function based on context
  • f(x,y) amount of similarity between x and y
  • Not necessarily a metric!
  • Use labeled data to train up this function
  • Classify all pairs with the learned function
  • Find the clustering that agrees most with the
    function
  • Problem divided into two separate phases
  • We deal with the second phase

9
Cohen, McCallum Richmans idea
Learn a similarity measure based on context
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
he
Saddam Hussein
10
A good clustering
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
he
Saddam Hussein
  • Consistent clustering
  • edges inside clusters
  • edges between clusters

11
A good clustering
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
he
Saddam Hussein
  • Consistent clustering
  • edges inside clusters
  • edges between clusters

Inconsistencies or mistakes
12
A good clustering
Mr. Rumsfield
his
Strong similarity
Strong dissimilarity
The secretary
No consistent clustering!
he
Saddam Hussein
Goal Find the most consistent clustering
Mistakes
13
Compared to traditional approaches
  • Do not have to specify k
  • Number of clusters can range from 1 to n
  • No condition on weights can be arbitrary
  • Clean notion of quality of clustering number of
    examples where the clustering differs from f
  • If a good (perfect) clustering exists, it is easy
    to find

14
From a Machine Learning perspective
  • Noise Removal
  • There is some true classification function f
  • But there are a few errors in the data
  • We want to find the true function
  • Agnostic Learning
  • There is no inherent clustering
  • Try to find the best representation using a
    hypothesis with limited expressivity

15
Correlation Clustering
  • Given a graph with positive (similar) and
    negative (dissimilar) edges, find the most
    consistent clustering
  • NP-hard Bansal, Blum, C, FOCS02
  • Two natural objectives
  • Maximize agreements
  • ( of ve inside clusters) ( of ve between
    clusters)
  • Minimize disagreements
  • ( of ve between clusters) ( of ve inside
    clusters)
  • Equivalent at optimality, but different in terms
    of approximation

16
Overview of results
  • Minimizing disagreements
  • Unweighted complete graph O(1) Bansal Blum C
    02
  • 4 Charikar et al 03
  • Weighted general graph O(log n) Charikar et al
    03
  • Demaine et al 03 Emmanuel et
    al 03
  • APX-hardness for weighted case Bansal Blum
    C02
  • constant lower bounds for both cases Charikar
    et al 03
  • Maximizing agreements
  • Unweighted complete graph PTAS Bansal Blum C
    02
  • Weighted general graphs 0.7664 Charikar et al
    03
  • 0.7666 Swamy 04
  • constant lower bound for weighted case
    Charikar et al 03

This talk
17
Minimizing Disagreements Bansal, Blum, C,
FOCS02
  • Goal approximately minimize number of mistakes
  • Assumption The graph is unweighted and complete
  • A lower bound on OPT Erroneous Triangles

Consider
-

Erroneous Triangle

Any clustering disagrees with at least one of
these edges
If several edge-disjoint erroneous ?s, then
any clustering makes a mistake on each one
Dopt ? Maximum fractional packing of erroneous
triangles
18
Using the lower bound ?-clean clusters
  • Relating erroneous triangles to mistakes
  • In special cases, we can charge-off
    disagreements to erroneous triangles
  • clean clusters
  • each vertex has few disagreements incident on it
  • few is relative to the size of the cluster
  • of disagreements ¼ of erroneous triangles

good vertex
Clean cluster ? All vertices are good
bad vertex
19
Using the lower bound ?-clean clusters
  • Relating erroneous triangles to mistakes
  • In special cases, we can charge-off
    disagreements to erroneous triangles
  • ?-clean clusters
  • each vertex in cluster C has fewer than ?C
    positive and
  • ?C negative mistakes
  • ? ? ¼ ? of disagreements ¼ of erroneous
    triangles
  • A high density of positive edges
  • We can easily spot them in the graph
  • Possible solution Find a ?-clean clustering, and
    charge disagreements to erroneous triangles
  • Caveat It may not exist

20
Using the lower bound ?-clean clusters
  • Caveat A d-clean clustering may not exist
  • An almost-?-clean clustering
  • All clusters are either ?-clean or contain a
    single node
  • An almost-?-clean clustering always exists
    trivially
  • We show
  • ? an almost-?-clean clustering that is almost as
    good as OPT
  • Nice structure helps us find it easily.

OPT(?)
21
OPT(?) clean or singleton
Optimal Clustering
bad vertices
Imaginary Procedure
Few (? ? fraction) bad nodes ? remove them
from cluster
New cluster O(?)-clean few new mistakes
? by a 1/? factor
22
OPT(?) clean or singleton
Optimal Clustering
bad vertices
Imaginary Procedure
OPT(?) All clusters are ?-clean
or singleton
Many (? ? fraction) bad nodes ? break up
cluster
New singleton clusters mistakes ? by a
1/?2 factor
Few new mistakes
23
Our algorithm
  • Goal Find nearly clean clusters
  • Pick an arbitrary vertex v C ? green (ve)
    neighbors of v
  • Remove any bad vertices from C
  • Add vertices that are good w.r.t. C
  • Output C and recurse on the remaining graph
  • If C is empty for all choices of v, output
    remaining vertices as singletons

24
Finding clean clusters
OPT(?)
Charging-off mistakes 1. Mistakes among clean
clusters - charge to erron. ?s 2. Mistakes
among singletons - no more than
corresponding mistakes in OPT(?)
ALG
O(?)-clean clusters
? constant factor approximation
25
Maximizing Agreements
  • Easy to obtain a 2-approximation
  • If (pos. edges) gt (neg. edges)
  • put everything in one
    cluster
  • Otherwise, n singleton clusters
  • Get at least half the edges correct
  • Max score possible total number of edges
  • 2-approximation

26
Maximizing Agreements
  • Max possible score ½n2
  • Goal obtain an additive approx of en2
  • Standard approach
  • Draw small sample
  • Guess partition of sample
  • Compute partition of remainder
  • Running time doubly expl in e, or singly with
    bad exponent.

27
Experimental Results Wellner McCallum03
Dataset 3
Dataset 2
Dataset 1
70.41
88.83
90.98
Best-previous-match
60.83
88.90
91.65
Single-link-threshold
73.42
91.59
93.96
Correlation clustering
10
24
28
age error reduction over previous best
(age Accuracy of classification)
28
Future Directions
  • Better combinatorial approximation
  • A good iterative approximation
  • on few changes to the graph, quickly recompute a
    good clustering
  • Minimizing Correlation
  • number of agreements number of disagreements
  • log-approx known can we get a constant factor
    approx?

29
Questions?
30
Future Directions
  • Clustering with small clusters
  • Given that all clusters in OPT have size at most
    k, find a good approximation
  • Is this NP-hard?
  • Different from finding best clustering with small
    clusters, without guarantee on OPT
  • Clustering with few clusters
  • Given that OPT has at most k clusters, find an
    approximation
  • Maximizing Correlation
  • number of agreements number of disagreements
  • Can we get a constant factor approximation?

31
Lower Bounding Idea Erroneous Triangles
If several edge-disjoint erroneous ?s, then
any clustering makes a mistake on each one

-

1
2
2 Edge disjoint erroneous triangles (1,2,3),
(1,4,5)
5
3 mistakes
4
3
Dopt ? Maximum fractional packing of erroneous
triangles
32
Open Problems
  • Clustering with small clusters
  • In most applications, clusters are very small
  • Given that all clusters in OPT have size at most
    k, find a good approximation
  • Different from finding best clustering with small
    clusters, without guarantee on OPT
  • Optimal solution for unweighted graphs?
    A possible approach
  • Any two vertices in the same cluster in OPT are
    neighbors or share a common neighbor.
  • We can find a list O(n2k) clusters, such that all
    OPTs clusters are in this list
  • When k is small, only polynomially many choices
    to pick from

33
Open Problems
  • Clustering with few clusters
  • Given that OPT has at most k clusters, find an
    approximation
  • Consensus clustering
  • Given a sum of k clusterings find best
    consensus clustering
  • easy 2-approximation can we get a PTAS?
  • Maximizing Correlation
  • number of agreements number of disagreements
  • bad case of disagree constant fraction of
    total weight
  • Charikar Wirth obtained a constant factor
    approximation
  • Can we get a PTAS in unweighted graphs?

34
Overview of results
Min Disagree
Max Agree
O(1) Bansal Blum C 02
PTAS Bansal Blum C 02
Unweighted (complete) graphs
4 Charikar Guruswami Wirth 03
APX-hard CGW 03
1.3048
CGW 03
O(log n)
Emanuel Fiat 03
Immorlica Demaine 03
1.3044 Swamy 04
Weighted graphs
Charikar Guruswami Wirth 03
1.0087
29/28
CGW 03
CGW 03
35
Typical characteristics
  • No well-defined similarity metric
  • inconsistencies in beliefs
  • Number of clusters is unknown
  • No predefined topics
  • desirable to figure them out as part of the
    algorithm
  • Fuzzy boundaries how to cluster may depends on
    the given set of objects
Write a Comment
User Comments (0)
About PowerShow.com