ParameterFree Spatial Data Mining Using MDL. - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

ParameterFree Spatial Data Mining Using MDL.

Description:

Features correspond to species present in specific cells. ... Conference on Extending Database Technology, Avignon, France, Mar. 1996. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 20
Provided by: michaelB53
Learn more at: https://cis.temple.edu
Category:

less

Transcript and Presenter's Notes

Title: ParameterFree Spatial Data Mining Using MDL.


1
Parameter-Free Spatial Data Mining Using MDL.
  • S. Papadimitriou, A. Gionis, P. Tsaparas, R.A.
    Väisänen, H. Mannila, and C. Faloutsos.
  • International Conference on Data Mining 2005

2
Problems
  • Finding patterns of spatial correlation and
    feature co-occurrence.
  • Automatically
  • That is, parameter-free.
  • Simultaneously
  • For example
  • Spatial locations on a grid.
  • Features correspond to species present in
    specific cells.
  • Each pair of cell and species is 0 or 1,
    depending on species present in that cell.
  • Feature co-occurrence
  • Cohabitation of species.
  • Spatial correlation
  • Natural habitats for species.

3
Motivation
  • Many applications
  • Biodiversity Data
  • As we just demonstrated.
  • Geographical Data
  • Presence of facilities on city blocks.
  • Environmental Data
  • Occurrence of events (storms, drought, fire,
    etc.) in various locations.
  • Historical and Linguistic Data
  • Occurrence of words in different
    languages/countries, historical events in a set
    of locations.
  • Existing methods either
  • Detect one pattern, but not both, or
  • Require user-input parameters.

4
Background
  • Minimum Description Length (MDL)
  • Let L(DM) denote the code length required to
    represent data D given (using) model M. Let L(M)
    be the complexity required to describe the model
    itself.
  • The total code length is then
  • L(D, M) L(DM) L(M)
  • This was used in SLIQ and is the intuitive notion
    behind the connection between data mining and
    data compression.
  • The best model minimizes L(D, M), resulting in
    optimal compression.
  • Choosing the best model is a problem in its own
    right.
  • This will be explored further in the next paper I
    present.

5
Background
  • Quadtree Compression
  • Quadtrees
  • Used to index and reason about contiguous
    variable size grid regions (among other
    applications, mostly spatial).
  • Used for 2D data kD analogue is a kD-tree.
  • Full Quadtree All nodes have either 0 or 4
    children.
  • Thus, all internal nodes correspond to a
    partitioning of a rectangular region into 4
    subregions.
  • Each quadtrees structure corresponds to a unique
    partitioning.
  • Transmission
  • If we only care about the structure (spatial
    partitioning), we can transmit a 0 for internal
    nodes and a 1 for leaves in depth-first order.
  • If we transmit the values as well, the cost is
    the number of leaves times the entropy of the
    leaf value distribution.

6
Example
7
Quadtree Encoding
  • Let T be a quadtree with m leaf nodes, of which
    mp have value p.
  • The total codelength is
  • If we know the distribution of the leaf values,
    we can calculate this in constant time.
  • Updating the tree requires O(log n) time in the
    worst case, as part of the tree may require
    pruning.

8
Binary Matrices / Bi-groupings
  • Bi-grouping
  • Simultaneous grouping of m rows and n columns
    into k and l disjoint row and column groups.
  • Let D denote an m x n binary matrix.
  • The cost of transmitting D is given as follows
  • Recall the MDL Principle L(D) L(DM) L(M).
  • Let Qx, Qy be a bi-grouping.
  • Lemma (we will skip the proof)
  • The codelength for transmitting an m-to-k mapping
    Qx where mp symbols are mapped to the value p is
    approximately

9
Methodology
  • Exploiting spatial locality
  • Bi-grouping as presented is nonspatial!
  • To make it spatial, assign a non-uniform prior to
    possible groupings.
  • That is, adjacent cells are more likely to belong
    to the same group.
  • Row groups correspond to spatial groupings.
  • Neighborhoods
  • Habitats
  • Row groupings should demonstrate spatial
    coherence.
  • Column groups correspond to families.
  • Mountain birds
  • Sea birds
  • Intuition
  • Alternately group rows and columns iteratively
    until the total cost L(D) stops decreasing.
  • Finding the global optimum is very expensive.
  • So our approach will use a greedy search for
    local optima.

10
Algorithms
  • INNER
  • Group given the number of row and column groups.
  • Start with an arbitrary bi-grouping
    of matrix D into k row groups and l column
    groups.
  • do
  • Let
  • for each row i from 1 to n
  • 1 p k such that the cost gain
  • is maximized.
  • Repeat for columns, producing the bi-grouping
  • t 2
  • while (L(D) is decreasing)

11
Algorithms
  • OUTER
  • Finds the number of row and column groups.
  • Start with k0 l0 1.
  • Split the row group p with the maximum per-row
    entropy, holding the columns fixed.
  • Move each row in p to a new group kT1 iff doing
    so would decrease the per-row entropy of p,
    resulting in a grouping
  • Assign group to the result of INNER
  • If the cost does not decrease, return
  • Otherwise, increment t and repeat.
  • Finally, perform this again for the columns.

12
Complexity
  • INNER is linear with respect to nonzero elements
    in D.
  • Let nnz denote those elements.
  • Let k be the number of row groupings and l be the
    number of column groupings.
  • Row swaps are performed in the quadtree and take
    O(log m) time each, where m is the number of
    cells.
  • Let T be the iterations required to minimize the
    cost.
  • O(nnz (k l log m) T)
  • OUTER, though quadratic with respect to (k l),
    is linear with respect to the dominating term
    nnz.
  • Let n be the number of row splits.
  • O((k l)2nnz (k l) n log m)

13
Experiments
  • NoisyRegions
  • Three features (species) on a 32x32 grid.
  • So D has 32x32 1024 rows.
  • And 3 columns.
  • 3 of each cell, chosen at random, has a wrong
    species, also randomly chosen.
  • The spatial and non-spatial groupings are shown
    to the right.
  • Recall Bi-grouping is not spatial by default.
  • Spatial grouping reduces the total codelength.
  • The approach is not quite perfect due to the
    heuristic nature of the algorithm.

14
Experiments
  • Birds
  • 219 Finnish bird species over 3813 10x10km
    habitats.
  • Species are the features, habitats are cells.
  • So our matrix is 3813x219.
  • The spatial grouping is clearly more coherent.
  • Spatial grouping reveals Boreal zones
  • South Boreal Light Blue and Green.
  • Mid Boreal Yellow.
  • North Boreal Red.
  • Outliers are (correctly) grouped alone.
  • Species with specialized habitats.
  • Or those reintroduced into the wild.

15
Other approaches
  • Clustering
  • k-means
  • Variants using different estimates of central
    tendency
  • k-medoids, k-harmonic means, spherical k-means,
  • Variants determining k based on some criteria
  • X-means, G-means,
  • BIRCH
  • CURE
  • DENCLUE
  • LIMBO
  • Also information-theoretic.
  • Approaches either lossy, parametric, or arent
    easily adaptable to spatial data.

16
Room for improvement
  • Complexity
  • O(n log m) cost for reevaluating the quadtree
    codelength.
  • O(log m) worst-case time for each
    reevaluation/row swap n swaps.
  • However, the average-case complexity is probably
    much better.
  • If we know something about the data distribution,
    we might be able to reduce this.
  • Faster convergence
  • Fewer iterations, reducing the scaling factor T.
  • Rather than stopping only when there is no
    decrease in cost, perhaps stop when we fall below
    a threshold? (Introduces a parameter)
  • Accuracy
  • The search will only find local optima, leading
    to errors.
  • We can employ some approaches used in annealing
    or genetic algorithms to attempt to find the
    global optimum.
  • Randomly restarting in the search space, for
    example.
  • Stochastic gradient descent similar to what
    were already doing, actually.

17
Conclusion
  • Simultaneous and automatic grouping of spatial
    correlation and feature co-habitation.
  • Easy to exploit spatial locality.
  • Parameter-free.
  • Utilizes MDL
  • Minimizes the sum of the model cost and the data
    cost given the model.
  • Efficient.
  • Almost linear with the number of entries in the
    matrix.

18
References
  • S. Papadimitriou, A. Gionis, P. Tsaparas, R.A.
    Vaisanen, H. Mannila, C. Faloutsos,
    "Parameter-Free Spatial Data Mining Using MDL",
    ICDM, Houston, TX, U.S.A., November 27-30, 2005.
  • M. Mehta, R. Agrawal and J. Rissanen, "SLIQ A
    Fast Scalable Classifier for Data Mining", in
    Proceedings of the 5th International Conference
    on Extending Database Technology, Avignon,
    France, Mar. 1996.

19
Thanks!
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com