P1253297828ICDEH - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

P1253297828ICDEH

Description:

compact partitioning. Unsupervised ... The prototypical compact partitioning method: K-means clustering ... compact partitioning and unsupervised learning. ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 24
Provided by: werners6
Category:

less

Transcript and Presenter's Notes

Title: P1253297828ICDEH


1
Unsupervised learning Statistical and
computational perspectives Werner
Stuetzle Professor and Chair, StatisticsAdjunct
Professor, Computer Science and
EngineeringUniversity of Washington,
Seattle Supported by NSF grant DMS-9803226 and
NSA grant 62-1942. Work performed while on
sabbatical at ATT Labs - Research.
2
  • 1. Introduction
  • Given Collection of n objects, characterized by
    feature vectors x1, , xn.
  • General goal of unsupervised learning
  • Detect presence of distinct groups
  • Assign objects to groups
  • Note Important to distinguish between
    unsupervised learning and compact
    partitioning
  • Unsupervised learning Identify distinct groups
  • Compact partitioning Partition collection of
    objects into compact strata

3
  • The prototypical compact partitioning
    methodK-means clustering
  • Let Pk P1 ,, Pk be a partition of the
    observations into k groups.
  • Measure badness of a partition by the sum of
    squared distances of observations from their
    group means
  • Find optimal partition (for example with the
    Lloyd algorithm)
  • Note
  • K-means clustering can be successful at finding
    groups if
  • we picked the correct k
  • groups are roughly spherical, and
  • approximately of the same size
  • For the remainder of the talk, will focus on
    unsupervised learning

4
  • 2. Approaches to Unsupervised Learning
  • Regard feature vectors x1, , xn as sample from
    some density p(x)
  • Parametric approach (Cheeseman, McLachlan,
    Raftery)
  • Based on premise that each group g is
    represented by density pg that is a member of
    some parametric family gt p(x) is a mixture
  • Estimate the parameters of the group densities,
    the mixing proportions, and the number of
    groups from the sample.
  • Nonparametric approach (Wishart, Hartigan)
  • Based on the premise that distinct groups
    manifest themselves as multiple modes of p(x)
  • Estimate modes from sample
  • Will pursue nonparametric approach

5
3. Describing the modal structure of a
density Consider feature vectors x1 , . , xn as
a sample from some density p(x) . Define level
set L(c p) as the subset of feature space for
which the density p(x) is greater than c. Note
Level sets with multiple connected components
indicate multi-modality There might
not be a single level set that reveals all the
modes
6
  • The cluster tree of a density
  • Modal structure of density is described by
    cluster tree.
  • Each node N of cluster tree
  • represents a subset D(N) of feature space
  • is associated with a density level c(N)
  • Root node
  • represents the entire feature space
  • is associated with density level c(N) 0
  • Tree defined recursively to determine
    descendents of node N
  • Find lowest level c for which intersection of
    D(N) with L(c p) has two connected components
  • If there is no such c then N is leaf of tree
    leaves of tree ltgt modes
  • Otherwise, create daughter nodes representing
    the connected components, with associated level
    c

7
Goal Estimate the cluster tree of the
underlying density p(x) from the sample feature
vectors x1 , . , xn First step Estimate p(x)
by density estimate p(x) (see below) Second
step Compute cluster tree of p (maybe
approximately)
8
4. Density estimation Consider feature vectors x1
, . , xn as a sample from some density
p(x). Goal Estimate p(x) Simplest idea Let
S(x, r) denote a sphere in feature space with
radius r, centered at x. Assuming density is
roughly constant over S(x, r), the expected
number of sample points in S(x, r) is
k n Volume ( S(x, r) ) p(x), giving
p(x) k / (n Volume ( S(x, r) )
Kernel estimate Fix radius r k of
sample feature vectors in S(x, r) K-near-neighbor
estimate Fix count k r smallest radius
for which
S(x, r) contains k sample feature
vectors Many refinements have been suggested
9
Example - kernel density estimate in 2-d
  • Swept under the rug
  • Choice of sphere radius r (for kernel estimate)
    or count k (for near-neighbor estimate) ---
    critical !! There are automatic methods.
  • Down-weight observations depending on distance
    from query point
  • Adaptive estimation --- vary radius r depending
    on density
  • Other types of estimates, etc, etc, etc
    (extensive literature)

10
  • Computational complexity
  • Computing kernel or near-neighbor estimate at
    query point x requires finding nearest neighbors
    of x in sample x1 , . , xn.
  • Can find k nearest neighbors of x in time log
    n using spatial partitioning schemes such as k-d
    trees, after n log n pre-processing
  • However
  • Spatial partitioning most effective if n large
    relative to d.
  • Theoretical analysis shows that number of
    nearest neighbors should increase with n and
    decrease with dimensionality d k n (4 / (d
    4)). Relevance ?
  • In low dimensions (d lt 4) can use histogram or
    average shifted histogram density estimates based
    on regular binning.
  • Evaluation for query point in constant time,
    after pre-processing n
  • High dimensionality may present problem

11
  • 5. Recursive algorithms for constructing a
    cluster tree
  • For most density estimates p(x), computing level
    sets and finding their connected components is a
    daunting problem --- especially in high
    dimensions.
  • Idea Compute sample cluster tree instead
  • Each node N of sample cluster tree
  • represents a subset X(N) of the sample
  • is associated with a density level c(N)
  • Root node
  • represents the entire sample
  • is associated with density level c(N) 0

12
  • To determine descendents of node N
  • Find lowest level c for which the intersection
    of X(N) with L(c p) falls into two
    connected components Note Intersection of
    X(N) with L(c p) consists of those feature
    vectors in the node N for which estimated
    density p(xi) gt c. _at_
  • If there is no such c then N is leaf of tree
  • Otherwise, create daughter nodes representing
    the connected components, with associated
    level c.
  • Note
  • _at_ is the critical step. Will in general have to
    rely on heuristic.
  • Daughters of a node N do not define a partition
    of X(N). Assigning low density observations in
    X(N) to one of the daughters is supervised
    learning problem

13
Illustration
14
  • Critical step
  • Find lowest level c for which observations in
    X(N) with estimated density p(xi) gt c fall into
    two connected components of level set L(c p)
  • Heuristic 1 (goes with k-near-neighbor
    density estimate)
  • Select feature vectors xi in X(N) with p(xi) gt
    c
  • Generate graph connecting each feature vector to
    its k nearest neighbors
  • Check whether graph has 1 or 2 connected
    components
  • Heuristic 2 (goes with kernel density
    estimate)
  • Select feature vectors xi in X(N) with p(xi) gt
    c
  • Generate graph connecting feature vectors with
    distance lt r
  • Check whether graph has 1 or 2 connected
    components

15
  • Related work
  • Looking for the connected components of a level
    set --- One-level Mode Analysis --- was first
    suggested by David Wishart (1969).
  • Wisharts paper appeared in obscure place ---
    Proceedings of the Colloquium in Numerical
    Taxonomy, St. Andrews, 1968. Nobody in CS cites
    Wishart.
  • Idea has been re-invented multiple times ---
    sharpening (Tukey Tukey) DBSCAN (Ester et
    al) Methods differ in heuristics for finding
    connected components of level set.
  • Wishart also realized that looking at single
    level set might not be enough to detect all the
    modes gt Hierarchical Mode Analysis. Did
    not think of it as estimating cluster tree.
    Algorithm awkward --- based on iterative
    merging instead of recursive partitioning.
    OPTICS method of Ankerst et al also considers
    level sets for different levels.

16
6. Constructing the cluster tree of the 1-near
neighbor density estimate The 1-near-neighbor
density estimate is defined by
p(x) 1 / distd (x, X) Advantage of
1-near-neighbor estimate Connected components of
level sets of p can be found exactly by
analyzing the minimal spanning tree of the
sample. Disadvantage of 1-near-neighbor
estimate Not a very good density estimate
noisy, singularities at observed feature vectors
xi. (Not necessarily fatal --- we dont care
about density per se) Noise and singularities
produce spurious nodes gt specify a minimum
cluster size
17
  • Computationally attractive
  • Computing and pre-processing minimal spanning
    tree n log n.
  • Deciding on whether a cluster with m
    observations should be split m
  • Have implemented this method and run a number of
    experiments on simulated data and data sets from
    machine learning.
  • Competitive with other methods that make
    implicit assumptions about shape of groups
    (like k-means, average linkage ..)
  • A lot better when assumptions made by those
    methods are violated.

18
  • 7. Summary and future work
  • The term clustering is ambiguous --- need to
    distinguish between compact partitioning and
    unsupervised learning.
  • Goal of unsupervised learning detect presence
    of distinct groups.
  • Assumption groups modes --- connected
    components of level sets --- of feature
    density.
  • This definition accommodates elongated and
    non-linear groups.
  • Modal structure of density is described by
    cluster tree.
  • Cluster tree is defined recursively --- suggests
    recursive partitioning.
  • Potentially many variations on basic algorithm,
    differing in
  • (1) estimate of feature density (2) heuristic
    for deciding when to split a node
  • Attractive choice 1-near-neighbor density
    estimate. Level sets and their connected
    components can be found exactly by analyzing
    minimal spanning tree of sample

19
  • Future work
  • Principled method for deciding on number of
    groups --- hard!
  • Sampling or aggregation methods for dealing with
    large data sets
  • Visualization Link cluster tree with other
    displays such as histograms, scatterplots, etc,
    to understand location and shape of clusters in
    feature space
  • Quantitative evaluation and comparison of methods

20
(No Transcript)
21
4. Finding the cluster tree of the estimated
density For most density estimates p(x),
computing level sets and finding their connected
components is a daunting problem --- especially
in high dimensions. Idea Compute sample cluster
tree instead
Sample cluster tree
Density cluster tree
  • Each node N
  • represents a subset D(N) of feature space
  • is associated with a density level c(N)
  • Root node
  • represents the entire feature space
  • is associated with density level c(N) 0
  • Each node N
  • represents a subset X(N) of the sample
  • is associated with a density level c(N)
  • Root node
  • represents the entire sample
  • is associated with density level c(N) 0

22
Density cluster tree
Sample cluster tree
  • To determine descendents of node N
  • Find lowest level b for which the
    intersection of X(N) with L(b p) falls into
    two connected components _at_
  • If there is no such b then N is leaf of tree
  • Otherwise, create daughter nodes representing
    the subsets of X(N), with associated level b
  • To determine descendents of node N
  • Find lowest level b for which intersection of
    D(N) with L(b p) has two connected
    components
  • If there is no such b then N is leaf of tree
  • Otherwise, create daughter nodes representing
    the connected components, with associated level
    b
  • _at_ The critical step
  • Easy to compute intersection of X(N) with level
    set L(b, p) it is the subset of the
    observations in X(N) for which p(xi) gt b
  • Hard to decide whether they fall into one or two
    connected components --- usually need heuristic

23
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com