DB Seminar Series: The Subspace Clustering Problem - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

DB Seminar Series: The Subspace Clustering Problem

Description:

2. Should also try to maximize the number of dimensions. ... Try modifying the model (e.g. add an attribute to a local structure), recalculate the score. If the ... – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 55
Provided by: kevi60
Category:

less

Transcript and Presenter's Notes

Title: DB Seminar Series: The Subspace Clustering Problem


1
DB Seminar SeriesThe Subspace Clustering Problem
  • By Kevin Yip
  • (17 May 2002)

2
Presentation Outline
  • Problem definition
  • Different approaches
  • Focus the projective clustering approach

3
Problem Definition Traditional Clustering
  • Traditional clustering problemTo divide data
    points into disjoint groups such that the value
    of an objective function is optimized.
  • Objective function to minimize intra-cluster
    distance and maximize inter-cluster distance.
  • Distance function define over all dimensions,
    numeric or categorical.

4
Problem Definition Traditional Clustering
  • ExampleProblem clustering points in 2-D
    space.Distance function Euclidean distance (d
    no. of dimensions, 2 in this case).

5
Problem Definition Traditional Clustering
  • Example (source CURE, SIGMOD 1998)

6
Problem Definition Distance Function Problem
  • Observation distance measures defined over all
    dimensions are sometimes inappropriate.
  • Example (source DOC, SIGMOD 2002)
  • C1 (x1, x2)
  • C2 (x2, x3)
  • C3 (x1, x3)

7
Problem Definition Distance Function Problem
  • As the number of noise dimensions increases, the
    distance functions become less and less accurate.
  • gt For each cluster, except the set of data
    points, we also need to find out the set of
    related dimensions (bounded attributes)

8
Problem Definition The Subspace Clustering
Problem
  • Formal DefinitionGiven a dataset of N data
    points and d dimensions, we want to divide the
    points into k disjoint clusters, each relating to
    a subset of dimensions, such that an objective
    function is optimized.
  • Objective function usually intra-cluster
    distance, each cluster uses its own set of
    dimensions in distance calculation.

9
Problem Definition The Subspace Clustering
Problem
  • Observation normal distance functions
    (Manhattan, Euclidean, etc.) give a smaller value
    if less dimensions are involved.
  • gt 1. Use a normalized distance function.gt 2.
    Should also try to maximize the number of
    dimensions.
  • Example (DOC) score(C, D) C(1/ß)D, C
    points in a cluster, D relating attributes, ß
    is a constant.

10
Different Approaches Overview
  • Grid-based dimension selection
  • Association rule hypergraph partitioning
  • Context-specific Bayesian clustering
  • Projective clustering (Focus)

11
Different Approaches Grid-Based Dimension
Selection
  • CLIQUE (98), ENCLUS (99), MAFIA (99), etc.
  • Basic idea
  • A cluster is a region with high density.
  • Divide the domain of each dimension into units.
  • For each dimension, find all dense units units
    with many points.
  • Merge neighboring dense units into clusters.
  • After finding all 1-d clusters, find 2-d dense
    units.
  • Repeat with higher dimensions.

12
Different Approaches Grid-Based Dimension
Selection
  • A 2-D dataset for illustration

13
Different Approaches Grid-Based Dimension
Selection
  • Divide the domain of each dimension into
    sub-units.

14
Different Approaches Grid-Based Dimension
Selection
  • Find all dense units units with many points.
  • (assume density threshold 3 points)

15
Different Approaches Grid-Based Dimension
Selection
  • Merge neighboring dense units into clusters.

16
Different Approaches Grid-Based Dimension
Selection
  • Find 2-d dense units. Merge neighboring dense
    units, if any.

17
Different Approaches Grid-Based Dimension
Selection
  • Repeat with higher dimensions.

18
Different Approaches Grid-Based Dimension
Selection
  • Results 1-d ltd1(2,5gt, ltd1(6,8gt, ltd2(1,3gt,
    ltd2(4,6gt.
  • 2-d ltd1,d2(4,5,(4,5gt, ltd1,d2(7,8,(4,5gt.

19
Different Approaches Grid-Based Dimension
Selection
  • Problems with the grid-based dimension selection
    approach
  • Non-disjoint clusters.
  • Exponential dependency on the number of
    dimensions.

20
Different Approaches - Association Rule
Hypergraph Partitioning
  • 1997
  • Cluster related items (attribute values) using
    association rules andcluster related
    transactions (data points) using clusters of
    items.

21
Different Approaches Association Rule
Hypergraph Partitioning
  • Procedures
  • Find all frequent itemsets in the dataset.
  • Construct a hypergraph with each item as a
    vertex, and each hyperedge corresponding to a
    frequent itemset.(If A, B, C is a frequent
    itemset, there is a hyperedge connecting the
    vertices of A, B, and C.)

22
Different Approaches Association Rule
Hypergraph Partitioning
  • Procedures
  • Each hyperedge is assigned a weight equal to a
    function of the confidences of all the
    association rules between the connecting
    items.(If there are association
    rulesAgtB,C (c. 0.8), A,BgtC (c.
    0.4),A,CgtB (c. 0.6), BgtA,C (c.
    0.4),B,CgtA (c. 0.8) and CgtA,B (c.
    0.6),then the weight of the hyperedge ABC can
    be the average of the confidences, i.e. 0.6)

23
Different Approaches Association Rule
Hypergraph Partitioning
  • Procedures
  • Use a hypergraph partitioning algorithm (e.g.
    HMETIS, 97) to divide the hypergraph into k
    partitions, so that the sum of the weights that
    straddle partitions is minimized. Each partition
    forms a cluster with different subset of items.
  • Assign each transaction to a cluster, based on a
    scoring function (e.g. percentage of matched
    items).

24
Different Approaches Association Rule
Hypergraph Partitioning
  • Problems with the association rule hypergraph
    partitioning approach
  • In real clusters, an item can be related to
    multiple clusters.
  • May not be applicable to numeric attributes.

25
Different Approaches Context-Specific Bayesian
Clustering
  • Naïve-Bayesian classification given a training
    set with classes Ci (i1..k), a data point with
    attribute values x1, x2, , xd is classified
    by P(CCi x1, x2, , xd) P(x1, x2, , xd
    CCi) P(CCi) / P(x1, x2, , xd)a P(x1, x2, ,
    xd CCi) P(CCi) P(x1CCi)P(x2CCi)P(xdCC
    i)P(CCi)

26
Different Approaches Context-Specific Bayesian
Clustering
  • A RECOMB 2001 paper
  • Context-specific independence (CSI) modeleach
    attribute Xi depends only on classes in a set Li.
  • E.g. if k5 and L11, 4, thenP(X1CC2)
    P(X1CC3) P(X1CC5) P(X1CCdef)

27
Different Approaches Context-Specific Bayesian
Clustering
  • A CSI model M containsk the number of
    classes.G the set of attributes that depend on
    some classes.Li the local structures of the
    attributes.
  • Parameters for a CSI model, ?M P(CCi),
    P(XiLiCj)

28
Different Approaches Context-Specific Bayesian
Clustering
  • Recall P(CCi x1, x2, , xd)a
    P(x1CCi)P(x2CCi)P(xdCCi)P(CCi),in the
    CSI model, it equalsP(X1LiCj)P(X2LiCj)P(XdL
    iCj)P(CCi)
  • So, for a dataset (without class labels), if we
    can guess a CSI model and its parameters, then we
    can assign each data point to a class gt
    clustering.

29
Different Approaches Context-Specific Bayesian
Clustering
  • Searching best model and parameters
  • Define a score to rank the current model and
    parameters (BIC(M, ?M) or CS(M, ?M)).
  • Randomly pick a model and a set of parameters and
    calculate the score.
  • Try modifying the model (e.g. add an attribute to
    a local structure), recalculate the score.
  • If the score is better, keep it and try modifying
    a parameter.

30
Different Approaches Context-Specific Bayesian
Clustering
  • Repeat until a stopping criterion is reached
    (e.g. using simulated annealing).
  • M1, ?M1 -gtM2, ?M1 -gtM2, ?M2 -gtM3, ?M2 -gt

31
Different Approaches Context-Specific Bayesian
Clustering
  • The scoring functions (just have a taste)

32
Different Approaches Context-Specific Bayesian
Clustering
  • Problems with the context-specific Bayesian
    clustering approach
  • Cluster quality and execution time not
    guaranteed.
  • Easily get into local minimum.

33
Focus The Projective Clustering Approach
  • PROCLUS (99), ORCLUS (00), etc.
  • K-medoid partitional clustering.
  • Basic idea use a set of sample points to
    determine the relating dimensions for each
    cluster, assign points to the clusters according
    to the dimension sets, throw away some bad
    medoids and repeat.

34
Focus The Projective Clustering Approach
  • Algorithm (3 phases)
  • Initialization phase
  • Input k target number of clusters.
  • Input l average number of dimensions in a
    cluster.
  • Draw Ak samples randomly from the dataset, where
    A is a constant.
  • Use max-min algorithm to draw Bk points from the
    sample, where B is a constant lt A. Call this set
    of points M.

35
Focus The Projective Clustering Approach
  • Iterative Phase
  • Draw k medoids from M.
  • For each medoid mi, calculate the Manhattan
    distance di (involving all dimensions) to the
    nearest medoid.
  • Find all points in the whole dataset that are
    within a distance di from mi.

36
Focus The Projective Clustering Approach
  • Finding the set of surrounding points for a
    medoid

B
A
C
37
Focus The Projective Clustering Approach
  • The average distance between the points and the
    medoid along each dimension will be calculated.
  • Among all kd dimensions, select kl of them with
    exceptionally small average distances. An extra
    restriction is that each medoid must pick at
    least 2 dimensions.
  • Whether the distance from medoid of a particular
    dimension is exceptionally small in a cluster
    is determined by its standard score

38
Focus The Projective Clustering Approach
  • Scoring dimensions

B
A
C
39
Focus The Projective Clustering Approach
  • Example
  • In cluster C1, the average distances from medoid
    along dimension D110, along D215 and along
    D313. In Cluster C2, the average distances are
    7, 6 and 12.
  • Mean(C1) (10 15 13) / 3 12.67.
  • S.D.(C1) 2.52.
  • Z(C1D1) (10-12.67)/2.52 -1.06.
  • Similarly, Z(C1D2) 0.93, Z(C1D3) 0.13,
    Z(C2D1) -0.41, Z(C2D2) -0.73, Z(C2D3) 1.14.
  • So the order to pick the dimensions will be C1D1
    -gt C2D2 -gt C2D1 -gt C1D3 -gt C1D2 -gt C2D3.

40
Focus The Projective Clustering Approach
  • Iterative Phase (contd)
  • Now, each medoid has a related set of dimensions.
    Assign all points in the whole dataset to the
    medoid closest to it (using a normalized distance
    function involving only the selected dimension).
  • Calculate the overall score of the clustering.
    Record the cluster definitions (relating
    attributes and assigned points) if the score is
    the new best one.
  • Throw away medoids with too few points. Replace
    them with some points remained in M.

41
Focus The Projective Clustering Approach
  • Refinement Phase
  • After determining the best set of medoids, use
    the assigned points to re-determine the sets of
    dimensions, and reassign all points.
  • If the distance between a point and its medoid is
    longer than the distance between the medoid and
    its closest medoid, the point is marked as an
    outlier.

42
Focus The Projective Clustering Approach
  • Experiment
  • Dataset synthetic, 100, 000 points, 20
    dimensions.
  • Set 1 5 clusters, each with 7 dimensions.
  • Set 2 5 clusters, with 2-7 dimensions.
  • Machine 233-MHz IBM RS/6000, 128M RAM, running
    AIX. Dataset stored in a 2GB SCSI drive.
  • Comparison CLIQUE (grid-based)

43
Focus The Projective Clustering Approach
  • Result accuracy (set 1)

44
Focus The Projective Clustering Approach
  • Result accuracy (set 1)

45
Focus The Projective Clustering Approach
  • Result accuracy (set 1)

46
Focus The Projective Clustering Approach
  • Result accuracy (set 2)

47
Focus The Projective Clustering Approach
  • Result accuracy (set 2)

48
Focus The Projective Clustering Approach
  • Scalability(withdatasetsize)

49
Focus The Projective Clustering Approach
  • Scalability(withaveragedimension-ality)

50
Focus The Projective Clustering Approach
  • Scalability(withspacedimension-ality)

51
Focus The Projective Clustering Approach
  • Problems with the projective clustering approach
  • Need to know l, the average number of dimensions.
  • A cluster with very small number of selected
    dimensions will absorb the points of other
    clusters.
  • Using a distance measure over the whole dimension
    space to select the sets of dimensions may not be
    accurate, especially when the number of noise
    attributes is large.

52
Summary
  • The subspace clustering problem given a dataset
    of N data points and d dimensions, we want to
    divide the points into k disjoint clusters, each
    relating to a subset of dimensions, such that an
    objective function is optimized.
  • Grid-based dimension selection
  • Association rule hypergraph partitioning
  • Context-specific Bayesian clustering
  • Projective clustering

53
References
  • Grid-based dimension selection
  • Automatic Subspace Clustering of High
    Dimensional Data for Data Mining Applications
    (SIGMOD 1998)
  • Entropy-based Subspace Clustering for Mining
    Numerical Data (SIGKDD 1999)
  • MAFIA Efficient and Scalable Subspace
    Clustering for Very Large Data Sets (Technical
    Report 9906-010, Northwestern University 1999)
  • Association rule hypergraph partitioning
  • Clustering Based On Association Rule
    Hypergraphs (Clustering Workshop 1997)

54
References
  • Multilevel Hypergraph Partitioning Application
    in VLSI Domain (DAC 1997)
  • Context-specific Bayesian clustering
  • Context-Specific Bayesian Clustering for Gene
    Expression Data (RECOMB 2001)
  • Projective clustering
  • Fast Algorithms for Projected Clustering
    (SIGMOD 1999)
  • Finding Generalized Projected Clusters in High
    Dimensional Spaces (SIGMOD 2000)
  • A Monte Carlo Algorithm for Fast Projective
    Clustering (SIGMOD 2002)
Write a Comment
User Comments (0)
About PowerShow.com