High-dimensional Indexing based on Dimensionality Reduction - PowerPoint PPT Presentation

About This Presentation
Title:

High-dimensional Indexing based on Dimensionality Reduction

Description:

LDR - Clustering Algo. Construct spatial clusters. Determine max number of clusters: M ... LDR - Clustering Algo (cont) Compute ... LDR - Clustering Algo (cont) ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 36
Provided by: compN
Category:

less

Transcript and Presenter's Notes

Title: High-dimensional Indexing based on Dimensionality Reduction


1
High-dimensional Indexing based on Dimensionality
Reduction
  • Students Qing Chen
  • Heng Tao Shen
  • Sun Ji Chun
  • Advisor Professor Beng Chin Ooi

2
Outlines
  • Introduction
  • Global Dimensionality Reduction
  • Local Dimensionality Reduction
  • Indexing Reduced-Dim Space
  • Effects of Dimensionality Reduction
  • Behaviors of Distance Matrices
  • Conclusion and Future Works

3
Introduction
  • High-Dim Applications
  • Multimedia, time-series, scientific, market
    basket, etc.
  • Various Trees Proposed
  • R-tree, R, R, X, Skd, SS, M, KDB, TV, Buddy,
    Grid File, Hybrid, iDistance, etc.
  • Dimensionality Curse
  • Efficiency drops quickly as dim increases.

4
Introduction
  • Dimensionality Reduction Techniques
  • GDR
  • LDR
  • High-Dim Indexing on RDS
  • Existing Indexing on single RDS
  • Global Indexing on multiple RDS
  • Side Effects of DR
  • Different Behaviors of Distance Matrices
  • Conclusion Future Work

5
GDR
  • Perform Reduction on the whole dataset.

6
GDR
  • Improving query accuracy by doing principal
    components analysis (PCA)

7
GDR
  • Using Aggregate Data for Reduction in Dynamic
    Spaces 8.

8
GDR
  • Works for Globally Correlated data.
  • GDR may cause significant info loss in real data.

9
LDR 5
  • Find locally correlated data clusters
  • Perform dimensionality reduction on on the
    clusters individually

10
LDR - Definitions
  • Cluster and subspace
  • Reconstruction Distance

11
LDR - Constraints on cluster
  • Reconstruction distance bound
  • I.e. MaxReconDist
  • Dimensionality bound
  • I.e. MaxDim
  • Size Bound
  • I.e. MinSize

12
LDR - Clustering Algo
  • Construct spatial clusters
  • Determine max number of clusters M
  • Determine the cluster range e
  • Choose a set of well scattered points as the
    centroids (C) of each spatial cluster
  • Apply the formula to all data points
  • Distance (P, Cclosest) lt e
  • Update the centroids of the cluster

13
LDR - Clustering Algo (cont)
  • Compute principal component (PC)
  • Perform PCA individually to all clusters
  • Compute mean value of each cluster points, I.e.
    Ei
  • Determine subspace dimensionality
  • Progressively checking each point against
    MaxReconDist and MaxDim
  • Decide the optimal demensionality for each
    cluster

14
LDR - Clustering Algo (cont)
  • Recluster points
  • Insert each points into the a suitable cluster or
    the outlier set O
  • I.e. ReconDist(P.S) lt MaxReconDist

15
LDR - Clustering Algo (cont)
  • Finally, apply the Size Bound to eliminate
    clusters with too few population. Redistribute
    the points to other clusters or set O.

16
LDR - Compare to GDR
  • LDR improves retrieval efficiency and
    effectiveness by capture more details on local
    data set.
  • But it consumes higher computational cost during
    the reduction steps.

17
LDR
  • LDR cannot discover all the possible correlated
    clusters.

18
Indexing RDS
  • GDR
  • One RDS only
  • Applying existing multi-dim indexing structure,
    e.g. R-tree, M-Tree
  • LDR
  • Several RDS in different axis systems
  • Global Indexing Structure

19
Global Indexing
Each RDS corresponds to one tree.
20
Side Effects of DR
  • Information loss -gt Lower precision
  • Possible Improvement?
  • Text Domain
  • DR -gt qualitative improvement
  • Least information loss -gt highest precision -gt
    Highest qualitative improvement

21
Side Effects of DR
  • Latent Semantic Indexing (U V) (LSI) 9,10,11

22
Side Effects of DR
  • DR effectively improve the data representation by
    understanding the data in terms of concepts
    rather than words.
  • Directions with greatest variance results in the
    use of Semantic aspects of data.

23
Side Effects of DR
  • Dependency among attributes results in poor
    measurements if using L-norm matrices.
  • Dimensions with largest eigenvalues highest
    quality 2.
  • So what else we have to consider?.

Inter-correlations
24
Mahalanobis Distance
  • Normalized Mahalanobis Distance

25
Mahalanobis vs. L-norm
26
Mahalanobis vs. L-norm
  • Take local shape into consideration by computing
    variance and covariance.
  • Tend to group points into elliptical clusters,
    which defines a multi-dim space whose boundaries
    determine the range of degree of correlation that
    is suitable for dim reduction.
  • Define the standard deviation boundary of the
    cluster.

27
Incremental Ellipse
  • aims to discover all the possible correlated
    clusters with different size, density and
    elongation.

28
Behaviors of Distance Matrices in Highdim Space
  • KNN is meaningful in high-dim space? 1
  • Furthest Neighbor/Nearest Neighbor is almost 1 -gt
    poor discrimination 4
  • One criterion as relative contrast

29
Behaviors of Distance Matrices in Highdim Space
  • on different dimensionality
    for different matrices

30
Behaviors of Distance Matrices in Highdim Space
  • Relative Contrast on L-norm Matrices

31
Behaviors of Distance Matrices in Highdim Space
  • For higher dimensionality, the relative contrast
    provided by a norm with smaller parameter is more
    likely to dominate another with a larger
    parameter.
  • So L-norm Matrices with smaller parameter is a
    better choice for KNN searching in high-dim
    space.

32
Conclusion
  • Two Dimensionality Reduction Methods
  • GDR
  • LDR
  • Indexing Methods
  • Existing Structure
  • Global Indexing Structure
  • Side Effects of DR
  • Qualitative Improvement
  • Both intra-variance and inter-variance
  • Different behaviors for different matrices
  • Smaller k achieves higher quality

33
Future work
  • Propose a new Tree for real high dimensional
    indexing without reduction for dataset without
    correlations?
  • (Beneath iDistance, further prune the searching
    sphere using LB-Tree)?
  • Reduce the dim of data points which are the
    combination of multi-features, such as images
    (shape, color, text, etc).

34
References
  • 1 Charu C. Aggarwal, Alexander Hinneburg,
    Daniel A. Keim On the Surprising Behavior of
    Distance Metrics in High Dimensional Spaces. ICDT
    2001420-434
  • 2 Charu C. Aggarwal On the Effects of
    Dimensionality Reduction on High Dimensional
    Similarity Search. PODS 2001
  • 3 Alexander Hinneburg, Charu C. Aggarwal,
    Daniel A. Keim What Is the Nearest Neighbor in
    High Dimensional Spaces? VLDB 2000 506-515
  • 4 K.Beyer, J.Goldstein, R.Ramakrishnan, and
    U.Shaft.When is nearest neighbors meaningful?
    ICDT, 1999.
  • 5 K.Chakrabart and S.Mehrotra.Local
    Dimensionality Reduction A New Approach to
    Indexing High Dimensional Spaces.VLDB, pages
    89--100, 2000.
  • 6 R.Weber, H.Schek, and S.Blott. A
    Quantitative Analysis and Performance Study for
    Similarity Search Methods in High Dimensional
    Spaces. VLDB, pages 194--205, 1998.
  • 7 C.Yu, B.C. Ooi, K.-L. Tan, and H.V.
    Jagadish. Indexing the Distance An Efficient
    Method to KNN Processing. VLDB, 2001.

35
References
  • 8 K. V. R. Kanth, D. Agrawal, and A. K. Singh.
    Dimensionality reduction for similarity searching
    dynamic databases. SIGMOD, 1998.
  • 9 Jon M. Kleinberg, Andrew Tomkins
    Applications of Linear Algebra in Information
    Retrieval and Hypertext Analysis. PODS 1999
    185-193
  • 10 Christos H. Papadimitriou, Prabhakar
    Raghavan, Hisao Tamaki, Santosh Vempala Latent
    Semantic Indexing A Probabilistic Analysis. PODS
    1998 159-168
  • 11 Chris H.Q. Ding. A similarity-based
    Probability model for latent semantic indexing.
    SIGIR 1999 59-65
  • 12 Alexander Hinneburg, Charu C. Aggarwal,
    Daniel A. Keim. What is the nearest neighbor in
    high dimensional spaces? VLDB 2000
Write a Comment
User Comments (0)
About PowerShow.com