An Efficient Method for Projected Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

An Efficient Method for Projected Clustering

Description:

Clustering is a widely used technique for data mining, ... Penalty Mins. Played Mins. Points. Position. Name ... {Borque, Gretzky, Tkachuk} {Points, Played Mins} ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 24
Provided by: css64
Category:

less

Transcript and Presenter's Notes

Title: An Efficient Method for Projected Clustering


1
An Efficient Method for Projected Clustering
  • Hongyin Cui Jiang Ye
  • School of Computing Science
  • Simon Fraser University

2
Introduction
  • Clustering is a widely used technique for data
    mining, indexing and classification.
  • Most clustering algorithms do not work
    efficiently or effectively in high dimensional
    spaces because of the inherent sparsity of the
    data.

3
Clusters may exist in different subspaces
comprised of different combinations of attributes
4
Related Work
  • CLIQUE density-based and grid-based
  • It partitions each dimension into the same number
    of equal length intervals
  • It partitions a m-dimensional data space into
    non-overlapping rectangular units
  • A unit is dense if the fraction of total data
    points contained in the unit exceeds the input
    model parameter
  • A cluster is a maximal set of connected dense
    units within a subspace
  • A bottom-up greedy algorithm
  • exponential dependency on the number of dimensions

5
Related Work (Cont.)
  • PROCLUS
  • Find the best set of medoids by a hill climbing
    process.
  • Search not just in the space of the possible
    medoids but also in the space of possible
    dimensions associated with each medoid.
  • So it uses a locality analysis and its result may
    be a local optimum.

6
Related Work (Cont.)
  • DOC
  • An dense projective cluster is a pair (C, D), and
  • C is a subset of the data set S, D is a subset of
    full-dimension d
  • C must be sufficiently large, i.e. C ?S
  • ?i?D, maxp?C pi - minq?C qi w
  • ?i?d - D, maxp?C pi - minq?C qi gt w
  • It repeatedly
  • choose p?S and X?S via radom sampling
  • compute the corresponding cluster (C, D)
  • Report the best found cluster
  • An approximation of the optimal projective
    cluster.

7
Problem Definition
  • key observations
  • Often many records in a database share similar
    values for several attributes.
  • Identifying and grouping together records that
    share similar values for some attributes can both
    gain useful insight into the data (projected
    clusters), and obtain a more parsimonious
    representation of the data.

8
Problem Definition (cont.)
  • The user can define discretization criteria by
    specifying the interval wi for each attribute i,
    or using a global interval w for all attributes.
  • We say a group of records share a similar value
    on attribute i, if they have a same discretized
    value on i. (in a same interval)
  • For example

Name Position Points Played Mins Penalty Mins
Blake Defense 43 395 34
Borque Defense 80 430 22
Gullimore Defense 3 30 18
Gretzky Centre 89 458 26
Konstantinov Defense 10 560 120
May Winger 35 290 180
Odjick Winger 9 115 245
Tkachuk Center 82 475 160
Wotton Defense 5 38 6
Figure 1 A fragment of the NHL Players
Statistic Table (1996)
9
Problem Definition (cont.)
  • In Figure 1, suppose the discretization intervals
    imposed on attributes are
  • Position gt already discrete
  • wPoints 10, wPlayedMins 60, wPenaltyMins20
  • Find out
  • Borque, Gretzky, Tkachuk Points,
    Played Mins
  • Gullimore, Wotton
    Position, Points, Played Mins, Penalty Mins

played and scored a lot
Same position Played, scored penalized
sparingly
10
Problem Definition (cont.)
  • Definition
  • Let p (p1, , pd) be a point in Rd, d denotes
    the set of the d dimensions, and wi 0 for 0 i
    d.
  • ?i?d, dimension i is partitioned and pi is
    discretized by wi.
  • Let S be a set of points in Rd. For any 0? 1,
    a projected cluster in S is a pair (C, D), C ? S,
    D ? d, such that
  • C ?S
  • ?j?D, all points in C share an equivalent
    discretized value on attribute j. (in the same
    interval)
  • No D ? D also satisfies the above two conditions

11
FIPCLUS Mining projected clusters via frequent
closed itemsets
  • Basic Steps
  • Step 1 discretize p on each attribute.
  • Step 2 create a transaction database.
  • Step 3 Mining frequent closed itemsets by
    CLOSET algorithm, each identify one
    subspace.
  • Step 4 find corresponding groups of points
    for each subspace via scanning DB once.

12
FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
  • Step 1
  • ?i?d, partition dimension i and discretize pi
    by wi.
  • Or Discretize pi using users specified criteria.
  • Or ignore step 1, if users provide discretized
    data.

13
FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
  • Step 2
  • ?i?d, enumerate and number each discretized
    value with a different integer j, and all numbers
    are continuous.
  • E.g. Positiondefense, center, winger, then
    defense1, center2 and winger3
  • Substitute each discretized value in d with an
    unique integer, idj.
  • The original database is transformed into a
    transaction database.

14
FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
  • Step 3 based on CLOSET Jie Pei, Jiawei Han
  • Definition (Frequent closed itemset) An itemset
    X is a closed itemset if there exists no itemset
    X such that (1) X is a proper superset of X,
    and (2) every transaction containing X also
    contains X. A closed itemset X is frequent if
    its support passes the given support threshold.
  • CLOSET is based on FP-tree without candidate
    generation.

15
FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
  • CLOSET
  • Input Transaction database TDB and support
    threshold min_sup
  • Output The complete set of frequent closed
    itemsets
  • Method
  • Initialization. Let FCI be the set of frequent
    closed itemset. Initialize FCI ?
  • Find frequent items. Scan transaction database
    TDB, compute frequent item list f_list.
  • Mine frequent closed itemsets recursively. Call
    CLOSET(?, TDB, f_list, FCI).

16
FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
  • CLOSET(X, DB, f_list, FCI)
  • Parameters
  • X is the frequent itemset.
  • DB X-conditional database, which is a subset of
    transactions in TDB containing X.
  • f_list frequent item list of DB
  • FCI The set of frequent closed itemsets already
    found.

17
FIPCLUS Mining projected clusters via frequent
closed itemsets (cont.)
  • CLOSET(X, DB, f_list, FCI)
  • Extract a set (Y) of items appearing in every
    transaction of DB, insert X?Y to FCI, if it is
    not a subset of some itemset in FCI with the same
    support
  • Build FP-tree for DB, items in Y are excluded.
  • Directly extract frequent closed itemsets from
    FP-tree.
  • ?i? rest of f_list, form conditional database
    DBi and compute local frequent item list f_listi
  • ?i? rest of f_list, call CLOSET(iX, DBI,
    f_listi, FCI), if iX is not a subset of any
    frequent closed itemset in FCI with the same
    support.

18
Evaluation Comparison
  • Definition - more flexible and meaningful.
  • No assumption on the distribution of C in D.
  • different interval wi on each dimension or
    flexible discretization criteria.
  • CLIQUE
  • partition each dimension into ? intervals, not
    flexible.
  • Hard to determine dense threshold for each unit.
  • PROCLUS
  • Distance-based has all distance-based flaws.
  • DOC
  • Very similar definition
  • But global interval width w for each dimension,
    not flexible.

19
Evolution Comparison
  • Algorithm
  • Solve clustering problem via mining frequent
    itemsets
  • more efficient, scalable and faster in large
    database.
  • Runtime complexity is O(N), where NDB.
    Typically 4, 5 scan of DB
  • CLIQUE
  • Bottom-up construction generate huge candidates,
    each of which need scan DB once. ---? not
    efficient
  • PROCLUS
  • Find the best set of medoids by a hill climbing
    process.
  • A locality analysis and its result may be a local
    optimum.
  • Runtime complexity O(N?k ?l N?k ?d), where k
    the number of clusters, l the average
    dimensionality of subspaces, d the full
    dimensionality. --? less efficient

20
Evaluation Comparison
  • DOC
  • Find the approximation of clusters via random
    sampling.
  • Not complete and quality can not be guaranteed.
  • Runtime complexity is O(N ? dc1), where c a
    constant, d the full dimensionality, and N
    DB. ---? less efficient

21
Conclusion
  • We proposed FIPCLUS, which
  • Efficiently mining projected clusters via
    frequent closed itemsets.
  • Applies a compressed FP-tree structure for mining
    frequent closed itemset without candidate
    generation.
  • Generates a much smaller set of frequent itemsets
    and leads to less and more interesting projected
    clusters.

22
Weakness Future Work
  • Weakness
  • FIPCLUS may generate some overlapping clusters.
  • E.g. For (C1, D1) and (C2, D2),
  • C1a, b, c, D1d1, d2, d3, d4 C2a, b,
    c, e, f, D2d1, d2
  • Future work
  • Modify FIPCLUS to mine the maximal frequent
    itemsets to address above weakness.
  • E.g. In above example, it only outputs (C1, D1).
  • It is actually a tradeoff, since maximal frequent
    itemsets may lose some interesting clusters and
    information.
  • Evaluate its effectiveness.

23
References
  • 1   R. Agrawal, J. Gehrke, D. Gunopulos, P.
    Raghavan, Automatic Subspace Clustering of High
    Dimensional Data for Data Mining Application
  • 2   C. Procopiuc, M. Jones, P. Agarwal,
    T.M.Murali, A Monto Carlo Algorithm for Fast
    Projective Clustering
  • 3   C. Aggarwal, C. Procopius, J. Wolf, P. Yu,
    J. Park, Fast Algorithm for Projected Clustering
  • 4   J. Pei, J. Han, R. Mao, CLOSET An
    Efficient Algorithm for Mining Frequent Closed
    Itemsets 
  • 5   K. Yip, D. Cheung, M. Ng, A Highly-usable
    Projected Clustering Algorithm for Gene
    Expression Profiles
  • 6   H.V. Jagadish, J. Madar, R. Ng, Semantic
    Compression and Pattern Extraction with Fascicles
Write a Comment
User Comments (0)
About PowerShow.com