DB Seminar Series: The Semisupervised Clustering Problem - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

DB Seminar Series: The Semisupervised Clustering Problem

Description:

A small amount of domain knowledge. Output: ... Little domain knowledge: number of resulting connected components (Kc) = 90% dataset size ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 61
Provided by: kevi60
Category:

less

Transcript and Presenter's Notes

Title: DB Seminar Series: The Semisupervised Clustering Problem


1
DB Seminar Series The Semi-supervised Clustering
Problem
  • By Kevin Yip
  • 2 Jan 2004
  • (Happy New Year!)

2
First of all, an example
  • Where are the 2 clusters?

3
First of all, an example
  • Where are the 2 clusters?

What if I tell you that these two are in the same
cluster
and these two are in the same cluster?
4
First of all, an example
  • Where are the 2 clusters?

5
Observations
  • The clustering accuracy was greatly improved
    (from 50 accuracy to 100).
  • All objects were unlabeled gt classification not
    possible.
  • Only a small amount of knowledge was added (the
    relationships between 2 pairs of objects, out of
    the total 105 pairs).
  • gt Semi-supervised clustering

6
Observations
  • Is it more natural to form 4 clusters instead?
  • True, but then how many clusters are there?
  • It is most natural to set k to the number of
    known classes.

7
Outline
  • The problem (no formal definitions)
  • Why not clustering? Classification?
  • Potential applications of semi-supervised
    clustering
  • Supervision what to input?
  • When to input?
  • How to alter the clustering process?
  • Some examples
  • Future Work

8
The problem
  • Input
  • A set of unlabeled objects, each described by a
    set of attributes (numeric and/or categorical)
  • A small amount of domain knowledge
  • Output
  • A partitioning of the objects into k clusters
    (possibly with some discarded as outliers)
  • Objective
  • Maximum intra-cluster similarity
  • Minimum inter-cluster similarity
  • High consistency between the partitioning and the
    domain knowledge

9
Why not clustering?
  • The clusters produced may not be the ones
    required.
  • Algorithms that fit the cluster model may not be
    available.
  • Sometimes there are multiple possible groupings.
  • There is no way to utilize the domain knowledge
    that is accessible (active learning v.s. passive
    validation).

(Guha et al., 1998)
10
Why not classification?
  • Sometimes there are insufficient labeled data
  • Objects are not labeled.
  • The amount of labeled objects is statistically
    insignificant.
  • The labeled objects do not cover all classes.
  • The labeled objects of a class do not cover all
    cases (e.g. they are all found at one side of a
    class).
  • The underlying class model may not fit the data
    (e.g. pattern-based similarity).

11
Potential applications
  • Bioinformatics (gene and protein clustering)
  • Document hierarchy construction
  • News/email categorization
  • Image categorization
  • Road lane detection
  • XML document clustering?
  • gt Cluster properties not well known, but some
    domain knowledge is available

12
What to input?
  • A small amount of labeled data
  • Instance-level constraints
  • Must-link two objects must be put to the same
    cluster.
  • Cannot-link two objects must not be put to the
    same cluster.
  • Objects that are similar/dissimilar to each
    other.
  • First-order logic language
  • Good cluster, bad cluster
  • General comments (e.g. clustering too fine)
  • Others

13
When to input?
  • At the beginning of clustering
  • During cluster validation, to guide the next
    round of clustering
  • When the algorithm requests
  • The user may give a dont know response.

14
How to alter the clustering process?
  • Guide the formation of seed clusters.
  • Force/recommend some objects to be put in the
    same cluster/different clusters.
  • Modify the objective function.
  • Modify the similarity function.
  • Modify the distance matrix.

15
Examples
16
Wagstaff et al. 2001
  • Input knowledge must-link and cannot-link
    constraints
  • Algorithm constrained K-means (COP-KMeans)
  • Generate cluster seeds.
  • Assign each object to the closest seed in turn,
    such that no constraints are violated (otherwise,
    report error and halt).
  • Make the centroid of each cluster as the new
    seed.
  • Repeat 2 and 3 until convergence.

17
Wagstaff et al. 2001
  • Some results
  • Soybean (47 x 35, 4 classes)
  • At most 47 x 46 / 2 1081 constraints
  • 100 coverage can be achieved by 43 must-links.

18
Wagstaff et al. 2001
  • Some results
  • Mushroom (50 x 21, 2 classes)

19
Wagstaff et al. 2001
  • Some results
  • Tic-tac-toe (100 x 9, 2 classes)

20
Basu et al. 2002
  • Input knowledge a small set of labeled objects
    for each class
  • Algorithm Seeded-KMeans
  • Use the centroid of each set of objects as an
    initial seed.
  • Assign each object to the closest seed.
  • Make the centroid of each cluster as the new
    seed.
  • Repeat 2 and 3 until convergence.

21
Basu et al. 2002
  • Algorithm Constrained-KMeans
  • Use the centroid of each set of objects as an
    initial seed.
  • Assign each object to the closest seed in turn,
    such that all labeled objects of each class are
    put to the same cluster.
  • Make the centroid of each cluster as the new
    seed.
  • Repeat 2 and 3 until convergence.

22
Basu et al. 2002
  • Some results
  • Same-3 Newsgroups (300 x 21631, 3 classes)

23
Basu et al. 2002
  • Some results
  • Different-3 Newsgroups (300 x 21631, 3 classes)

24
Klein et al. 2002
  • In the previous 2 approaches, unconstrained
    objects are not much affected if the amount of
    input is small.
  • Is it possible to propagate the influence to the
    neighboring objects?

25
Klein et al. 2002
  • An illustration

26
Klein et al. 2002
  • Idea modify the distance matrix such that
    must-link objects have zero distances and
    cannot-link objects have large distances.
  • Two problems
  • The effects are still restricted to constrained
    objects.
  • The resulting distance matrix may not satisfy the
    triangle inequality.
  • Solutions
  • Propagate the must-link influence by running an
    all-pairs shortest paths algorithm.
  • Propagate the cannot-link influence implicitly
    during clustering.

27
Klein et al. 2002
  • Must-link propagation

C
B
D
A
28
Klein et al. 2002
  • Must-link propagation

C
B
D
A
29
Klein et al. 2002
  • Must-link propagation

C
B
D
A
30
Klein et al. 2002
  • Cannot-link propagation
  • Similarly, if there is a cannot-link constraint
    on a pair of objects, the distance between the
    objects is set to a certain large value.
  • But it is harder to fix the triangle inequality.
  • If A and B must link, they will be given the same
    coordinates.
  • But if A and B cannot link, what coordinates
    should be given to them?
  • In general, it is desirable to have the new
    distance matrix as similar as the original one,
    but this poses a difficult optimization problem.

31
Klein et al. 2002
  • Cannot-link propagation
  • Solution not to propagate, but choose an
    algorithm that can implicitly produce the same
    effect during the clustering process.
  • E.g. complete-link hierarchical algorithm
  • Distance between two clusters longest distance
    between two objects each taken from one of the
    clusters
  • If A is cannot-linked with B, when A merges with
    C, C will also be cannot-lined with B.

32
Klein et al. 2002
  • Active learning
  • Up to now, we assume all constraints are inputted
    at the beginning.
  • Is it possible for the algorithm to request for
    useful constraints?
  • Advantages
  • Need fewer constraints
  • Constraints being input must be the most useful
    ones
  • Disadvantages
  • Need to answer queries on the fly
  • Not all queries can be answered

33
Klein et al. 2002
  • Active learning
  • Allow the algorithm to ask at most m questions
  • Observations
  • In hierarchical clustering, errors usually occur
    at later stages.
  • In complete-link clustering, the merged-cluster
    distance must be monotonically non-decreasing.
  • Idea estimate the distance ? such that all
    merges that involve clusters with distance ? ?
    are guided by the m questions.

34
Klein et al. 2002
  • Experiments
  • Algorithms
  • CCL (distance matrix learning)
  • CCL-Active (CCL active learning)
  • COP-KMeans (forced relationships)

35
Klein et al. 2002
  • Results on synthetic data (held-out acc.)

36
Klein et al. 2002
  • Results on synthetic data (held-out acc.)

37
Klein et al. 2002
  • Results on real data (held-out acc.)
  • Iris (150 x 4, 3 classes)
  • Crabs (200 x 5, 2 classes)
  • Soybean (562 x 35, 15 classes)

38
Klein et al. 2002
  • Effect of constraints

39
Xing et al. 2003
  • Is it possible to propagate the effects of
    constraints by transforming the distance function
    instead of the distance matrix?
  • Mahalanobis distance
  • When AI (identity matrix), d(x,y) is the
    Euclidean distance between x and y.
  • When A is a diagonal matrix, d(x,y) is the
    weighted Euclidean distance between x and y.

40
Xing et al. 2003
  • Input knowledge a set S of object pairs that are
    similar.
  • Objective find matrix A that minimizessubject
    to the constraints that
  • (to avoid the trivial solution of A0)
  • A is positive semi-definite (to ensure triangle
    inequality and d(x,y) ? 0 for all x, y)
  • (Is it also needed to minimize the deviation of
    the resulting distance matrix from the original
    one?)

41
Xing et al. 2003
  • Case 1 A is a diagonal matrix
  • It can be shown that the problem is equivalent to
    minimizing the following functionwhich can be
    solved by the Newton-Raphson method.

42
Xing et al. 2003
  • Case 2 A is not a diagonal matrix
  • Newtons method becomes very expensive (O(n6))
  • Use a gradient descent method to optimize the
    functions of an equivalent problem.
  • In both cases, metric learning is performed
    before the clustering process.

43
Xing et al. 2003
  • Visualization of the power of metric learning
    (with 100 input)

44
Xing et al. 2003
  • Experiments
  • Algorithms
  • KMeans
  • KMeans learn diagonal A
  • KMeans learn full A
  • COP-KMeans
  • COP-KMeans learn diagonal A
  • COP-KMeans learn full A
  • Input size
  • Little domain knowledge number of resulting
    connected components (Kc) 90 dataset size
  • Much domain knowledge Kc 70 data size

45
Xing et al. 2003
  • Some results on real data

46
Basu et al. 2003
  • Main contributions
  • Combine both metric learning and constraints
    enforcing.
  • Soft constraints constraint violations cause
    penalties instead of algorithm halt.
  • Introduce a new active learning method.

47
Basu et al. 2003
  • The generalized K-Means model
  • Maximizing the complete data log-likelihood is
    equivalent to minimizing the following objective
    function
  • Can be optimized (locally) by an EM algorithm.

48
Basu et al. 2003
  • Adding constraint violation penalties
  • M/C a set of must-link/cannot-link constraints
  • ?, ? weights of constraints based on external
    knowledge
  • 1true1, 1false0
  • FM, FC weights of constraints based on the
    distance between the objects (should FM(x, y) be
    larger or smaller when d(x, y) is larger?)

49
Basu et al. 2003
  • A simplified objective function
  • Algorithm (PCKMeans)
  • From M and C, identify the pre-specified
    connected components.
  • Use the centroids of the largest connected
    components as the initial seeds. If there are not
    enough seeds, use objects that are cannot-linked
    to all connected components. If still not enough,
    use random seeds.
  • Perform KMeans with the modified objective
    function.

50
Basu et al. 2003
  • A new active learning algorithm ask questions as
    early as possible
  • Explore step to form k initial
    neighborhoods.Observation k-means centers are
    usually remote from each other.Procedure
  • Draw a point that is farthest to all drawn
    points.
  • Ask at most k questions to determine which
    neighborhood the point should be assigned to.
  • Repeat until k neighborhoods are formed (or no
    more questions are allowed).

51
Basu et al. 2003
  • Consolidate step to form k clusters.Observation
    each object is close to its corresponding
    seed.Procedure
  • Use the centroids of the k neighborhoods as the
    initial seeds (all objects in a neighborhood is
    immediately assigned to the cluster).
  • Pick a non-assigned object x.
  • Starting from the closest seed, ask for the
    relationship between x and an object in the
    cluster until the response is a must-link.
  • Repeat to assign all objects (if running out of
    questions, resume to normal KMeans).

52
Basu et al. 2003
  • Some experiment results
  • News-all20 (2000 x 16089, 20 classes) left
  • News-diff3 (300 x 3251, 3 classes) right

53
Discussions and Future Work
  • Special clustering problems
  • How can input knowledge help dimension selection
    in projected clustering / pattern-based
    clustering?
  • Is it possible to refine produced clusters when
    new knowledge becomes available?
  • Other uses of input knowledge
  • Can input knowledge help determine algorithm
    parameter values / stopping conditions?

54
Discussions and Future Work
  • Clustering algorithm
  • What kind of algorithms could potentially benefit
    most from knowledge inputs?
  • Should we work on constrained objects first, or
    to mix them with other objects at the beginning?
    (Induction v.s. transduction)
  • Performance concerns
  • How is the speed/space performance affected by
    the introduction of external knowledge?
  • How will the paradigm be affected if there is a
    limited amount of memory?

55
Discussions and Future Work
  • Active learning
  • What to ask? Questions that are difficult for
    algorithms but easy for users.
  • When to ask? When there is a major consequence.
  • Noise
  • How to handle contradicting constraints/
    constraints that may not be 100 correct?
  • How much should be charged for a constraint
    violation?
  • What is the significance of knowledge inputs when
    the data is noisy?

56
Discussions and Future Work
  • Debate questions (from ICML 2003 Workshop
    homepage)
  • Unlabeled data is only useful when there are a
    large number of redundant features.
  • Why doesn't The No Free Lunch Theorem apply when
    working with unlabeled data?
  • Unlabeled data has to come from the same
    underlying distribution as the labeled data.
  • Can unlabeled data be used in temporal domains?
  • Feature engineering is more important than
    algorithm design for semi-supervised learning.
  • All the interesting problems in semi-supervised
    learning have been identified.
  • Active learning is an interesting "academic"
    problem.

57
Discussions and Future Work
  • Debate questions (from ICML 2003 Workshop
    homepage)
  • Active learning research without user interface
    design is only solving half the problem.
  • Using Unlabeled data in Data Mining is no
    different than using it in Machine Learning.
  • Massive data sets pose problems when using
    current semi-supervised algorithms.
  • Off-the-shelf data mining software incorporating
    labeled and unlabeled data is a fantasy.
  • Unlabeled data is only useful when the classes
    are well separated.

58
References
  • Ayhan Demiriz, K. P. Bennett and M. J. Embrechts,
    Semi-supervised Clustering using Genetic
    Algorithms, ANNIE 1999.
  • Luis Talavera and Javier Bejar, Integrating
    Declarative Knowledge in Hierarchical Clustering
    Tasks, ISIDA 1999.
  • David Cohn, Rich Caruana and Andrew McCallum,
    Semi-supervised Clustering with User Feedback,
    Unpublished Work 2000.
  • Kiri Wagstaff and Claire Cardie, Clustering with
    Instance-level Constraints, ICML 2000.
  • Kiri Wagstaff, Claire Cardie, Seth Rogers and
    Stefan Schroedl, Constrained K-means Clustering
    with Background Knowledge, ICML 2001.

59
References
  • Sugato Basu, Arindam Banerjee and Raymond Mooney,
    Semi-supervised Clustering by Seeding, ICML 2002.
  • Dan Klein, Sepandar D. Kamvar and Christopher D.
    Manning, From Instance-level Constraints to
    Space-level Constraints Making the Most of Prior
    Knowledge in Data Clustering, ICML 2002.
  • Eric P. Xing, Andrew Y. Ng, Michael I. Jordan and
    Stuart Russell, Distance Metric Learning, with
    Application to Clustering with Side-information,
    Advances in Neural Information Processing Systems
    15, 2003.

60
References
  • Sugato Basu, Mikhail Bilenko and Raymond J.
    Mooney, Comparing and Unifying Search-Based and
    Similarity-Based Approaches to Semi-Supervised
    Clustering, ICML-2003 Workshop on the Continuum
    from Labeled to Unlabeled Data in Machine
    Learning and Data Mining.
  • Sugato Basu, Arindam Banerjee and Raymond J.
    Mooney, Active Semi-Supervision for Pairwise
    Constrained Clustering, not yet published, 2003.
Write a Comment
User Comments (0)
About PowerShow.com