Indexing Multidimensional Feature Spaces - PowerPoint PPT Presentation

About This Presentation
Title:

Indexing Multidimensional Feature Spaces

Description:

Queries over Feature Spaces ... quite complex and ... to low dim. space works well when data correlated into a few dimensions only difficult to manage ... – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 82
Provided by: Shara160
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Indexing Multidimensional Feature Spaces


1
Indexing Multidimensional Feature Spaces
  • Overview of Multidimensional Index Structure
  • Hybrid Tree, Chakrabarti et. al. ICDE 1999
  • Local Dimensionality Reduction, Chakrabarti et.
    al. VLDB 2000

2
Queries over Feature Spaces
  • Consider a d-dimensional feature space
  • color histogram, texture,
  • Nature of Queries
  • range queries objects that reside within the
    region specified in the query
  • K-nearest neighbor queries objects that are
    closest to a query object based on a distance
    metric
  • Approx. nearest neighbor queries retrieved
    object is within (1 epsilon) of the real nearest
    neighbor.
  • All-pair (similarity join) queries retrieve all
    pairs of objects within a epsilon threshold.
  • A search algorithm may include
  • false positives objects that do not meet the
    query condition, but are retrieved anyway. We
    tend to minimize false positives
  • false negatives objects that meet the query
    condition but are not returned. Usually,
    approaches avoid false negatives

3
Approach Utilize Single Dimensional Index
  • Index on attributes independently
  • Project query range to each attribute determine
    pointers.
  • Intersect pointers
  • go to the database and retrieve objects in the
    intersection.

May result in very high I/O cost
4
Multiple Key Index
  • Index on one attribute provides pointers to an
    index on the other
  • Cannot support partial match queries on second
    attribute
  • performance of range search not much better
    compared to independent attribute approach
  • the secondary indices may be of different sizes
    -- specifically some of them may be very small

Index on first attribute
Index on second attribute
5
R-tree Data Structure
  • Extension of B-tree to multidimensional space.
  • Paginated, balanced, guaranteed storage
    utilization.
  • Can support both point data and data with spatial
    extent
  • Groups objects into possibly overlapping clusters
    (rectangles in our case)
  • Search for range query proceeds along all paths
    that overlap with the query.

6
R-tree Insert Object E
  • Step I1
  • Chooseleaf L to Insert E / find position to
    insert/
  • Step I2
  • If L has room install E
  • Else SplitNode(L)
  • Step I3
  • Adjust Tree / propagate Changes/
  • Step I4
  • if node split propagates to root adjust height of
    tree

7
ChooseLeaf
  • Step CL1
  • Set N to be root
  • Step CL2
  • If N is a leaf return N
  • Step CL3
  • If N is not a root, let F be an entry whose
    rectangle needs least enlargement to include
    object
  • break ties by choosing smaller rectangle
  • Step CL4
  • Set N to be child node pointed by entry F
  • goto Step CL2

8
Split Node
  • Given a node split it into two nodes which are
    each atleast half full
  • Multiple Objectives
  • minimize overlap
  • minimize covered area
  • R-tree minimizes covered area
  • What is an optimal criteria???

Minimize covered area
Minimize overlap
9
Minimizing Covered Area
  • Group objects into 2 parts such that the covered
    area is minimized
  • NP Hard!!
  • Hence use heuritics
  • Two heuristics explored
  • quadratic and linear

10
Basic Split Strategy
  • / Divide the set of M1 entries into 2 groups G1
    and G2 /
  • PickSeeds for G1 and G2
  • Invoke PickNext to assign an object to a group
    recursively until either all objects assigned or
    one of the groups becomes half full.
  • If one group gets half full assign rest of the
    objects to the other group.

11
Quadratic Split
  • PickSeed
  • for each pair of entries E1 and E2 compose a
    rectangle J including E1.rect and E2.rect
  • let d area(J) - area(E1.rect) - area(E2.rect)
    / d is wasted space /
  • Choose the most wasteful pair with largest d as
    seeds for groups G1 and G2.
  • PickNext /select next entry to put in a group /
  • Determine cost of putting each entry in the group
    G1 and G2
  • for each unassigned entry calculate
  • d1 area increase required in the covering
    rectangle in Group G1 to include the entry
  • d2 area increase required in the covering
    rectangle in Group G2 to include the entry.
  • Select entry with greatest preference for a group
  • choose any entry with the maximum difference
    between d1 and d2

12
Linear Split
  • PickSeed
  • find extreme rectangles along each dimension
  • find entries with the highest low side and the
    lowest high side
  • record the separation
  • Normalize the separation by width of extent along
    the dimension
  • Choose as seeds the pair that has the greatest
    normalized distance along any dimension
  • PickNext
  • randomly choose entry to assign

13
R-tree Search (Range Search on range S)
  • Start from root
  • If node T is not leaf
  • check entries E in T to determine if E.rectangle
    overlaps S
  • for all overlapping entries invoke search
    recursively
  • If T is leaf
  • check each entry to see if it entry satisfies
    range query

14
R-tree Delete
  • Step D1
  • find the object and delete entry
  • Step D2
  • Condense Tree
  • Step D3
  • if root has 1 node shorten tree height

15
Condense Tree
  • If node is underful
  • delete entry from parent and add to a set Q
  • Adjust bounding rectangle of parent
  • Do the above recursively for all levels
  • Reinsert all the orphaned entries
  • insert entries at the same level they were
    deleted.

16
Other Multidimensional Data Structures
  • Many generalizations of R-tree
  • different splitting criteria
  • different shapes of clusters (e.g., d-dimensional
    spheres)
  • adding redundancy to reduce search cost
  • store objects in multiple rectangles instead of
    a single rectangle to reduce cost of retrieval.
    But now insert has to store objects in many
    clusters. This strategy also increases overlap
    causing search performance to detoriate.
  • Space Partitioning Data Structures
  • unlike R-tree which group objects into possibly
    overlapping clusters, these methods attempt to
    partition space into non-overlapping regions.
  • E.g., KD tree, quad tree, grid files, KD-Btree,
    HB-tree, hybrid tree.
  • Space filling curves
  • superimpose an ordering on multidimensional space
    that preserves proximity in multidimensional
    space. (Z-ordering, hilbert ordering)
  • Use a B-tree as an index on that ordering

17
KD-tree
  • A main memory data structure based on binary
    search trees
  • can be adapted to block model of storage
    (KD-Btree)
  • Levels rotate among the dimensions, partitioning
    the space based on a value for that dimension
  • KD-tree is not necessarily balanced.

18
KD-Tree Example
X5
y6
y5
Y6
x8
x7
x3
Y2
y2
19
KD-Tree Operations
  • Search
  • straightforward. Just descend down the tree like
    binary search trees.
  • Insertion
  • lookup record to be inserted, reaching the
    appropriate leaf.
  • If room on leaf, insert in the leaf block
  • Else, find a suitable value for the appropriate
    dimension and split the leaf block

20
Adapting KD Tree to Block Model
  • Similar to B-tree, tree nodes split many ways
    instead of two ways
  • Risk
  • insertion becomes quite complex and expensive.
  • No storage utilization guarantee since when a
    higher level node splits, the split has to be
    propagated all the way to leaf level resulting in
    many empty blocks.
  • Pack many interior nodes (forming a subtree) into
    a block.
  • Risk
  • it may not be feasible to group nodes at lower
    level into a block productively.
  • Many interesting papers on how to optimally pack
    nodes into blocks recently published.

21
Quad Tree
  • Nodes split along all dimensions simultaneously
  • Division fixed by quadrants
  • As with KD-tree we cannot make quadtree levels
    uniform

22
Quad Tree Example
X7
X3
SW
NW
SE
NE
X5
X8
23
Quad Tree Operations
  • Insert
  • Find Leaf node to which point belongs
  • If room, put it there
  • Else, make the leaf an interior node and give it
    leaves for each quadrant. Split the points among
    the new leaves.
  • Search
  • straighforward just descend down the right
    subtree

24
Grid Files
  • Space Partitioning strategy but different from a
    tree.
  • Select dividers along each dimension. Partition
    space into cells
  • Unlike KD-tree dividers cut all the way.
  • Each cell corresponds to 1 disk page.
  • Many cells can point to the same page.
  • Cell directory potentially exponential in the
    number of dimensions

25
Grid File Implementation
  • Maintain linear scales for each dimension that
    contain split positions for the dimension
  • Cell directory implemented as a multidimensional
    array.
  • / can be large and may not fit in memory /

26
Grid File Search
  • Exact Match Search at most 2 I/Os assuming
    linear scales fit in memory.
  • First use liner scales to determine the index
    into the cell directory
  • access the cell directory to retrieve the bucket
    address (may cause 1 I/O if cell directory does
    not fit in memory)
  • access the appropriate bucket (1 I/O)
  • Range Queries
  • use linear scales to determine the index into the
    cell directory.
  • Access the cell directory to retrieve the bucket
    addresses of buckets to visit.
  • Access the buckets.

27
Grid File Insert
  • Determine the bucket into which insertion must
    occur.
  • If space in bucket, insert.
  • Else, split bucket
  • how to choose a good dimension to split?
  • If bucket split causes a cell directory to split
    do so and adjust linear scales.
  • / notice that cell directory split results in
    p(d-1) new entries to be created in cell
    directory /
  • insertion of these new entries potentially
    requires a complete reorganization of the cell
    directory--- expensive!!!

28
Grid File Insert
  • Inserting a new split position will require the
    cell directory to increase by 1 column. In d-dim
    space, it will cause p(d-1) new entries to be
    created

29
Space Filling Curve
  • Assumption
  • finite precision in representing each coordinate.

B
A
Z(A) shuffle(x_A, y_A) shuffle(00,11) 0101
5 Z(B) 11 3 (common prefix to all its
blocks) Z(C1) 0010 2 Z(C2) 1000 8
00 01 10 11
00 01 10 11
C
30
Deriving Z-Values for a Region
  • Obtain a quad-tree decomposition of an object by
    recursively dividing it into blocks until blocks
    are homogeneous.

11
01
Objects representation is 0001, 0011,01
00
10
00
11
01
00
11
31
Disk Based Storage
  • For disk storage, represent object based on its
    Z-value
  • Use a B-tree index.
  • Range Query
  • translate query range to Z values
  • search B-tree with Z-values of data regions for
    matches

32
Nearest Neighbor Search
  • Retrieve the nearest neighbor of query point Q
  • Simple Strategy
  • convert the nearest neighbor search to range
    search.
  • Guess a range around Q that contains at least one
    object say O
  • if the current guess does not include any
    answers, increase range size until an object
    found.
  • Compute distance d between Q and O
  • re-execute the range query with the distance d
    around Q.
  • Compute distance of Q from each retrieved object.
    The object at minimum distance is the nearest
    neighbor!!! Why?
  • Issues how to guess range, the retrieval may be
    sub-optimal if incorrect range guessed. Becomes a
    problem in high dimensional spaces.

33
Nearest Neighbor Search using Range Searches
Distance between Q and A
b
Initial range search
Q
A
Revised range search
A optimal strategy that results in minimum number
of I/Os possible using priority queues.
34
Alternative Strategy to Evaluating K-NN
Mindist(Q,B)
  • Let Q be the query point.
  • Traverse nodes in the data structure in the order
    of MINDIST(Q,N), where
  • MINDIST(Q,N) dist(Q,N), if N is an object.
  • MINDIST(Q,N) minimum distance between Q and any
    object in N, if N is an interior node.

B
Q
A
Mindist(Q, A)
Mindist(Q,C)
C
35
MINDIST Between Rectangle and Point
Q
Q
T
Q
S
36
Generalized Search Trees
  • Motivation
  • disparate applications require different data
    structures and access methods.
  • Requires separate code for each data structure to
    be integrated with the database code
  • too much effort.
  • Vendors will not spend time and energy unless
    application very important or data structure has
    general applicability.
  • Generalized search trees abstract the notion of
    data structure into a template.
  • Basic observation most data structures are
    similar and a lot of book keeping and
    implementation details are the same.
  • Different data structures can be seen as
    refinements of basic GiST structure. Refinements
    specified by providing a registering a bunch of
    functions per data structure to the GiST.

37
GiST supports extensibility both in terms of data
types and queries
  • GiST is like a template - it defines its
    interface in terms of ADT rather than physical
    elements (like nodes, pointers etc.)
  • The access method (AM) can customize GiST by
    defining his or her own ADT class i.e. you just
    define the ADT class, you have your access method
    implemented!
  • No concern about search/insertion/deletion,
    structural modifications like node splits etc.

38
Integrating Multidimensional Index Structures as
AMs in DBMS
Generalized Search Trees (GiSTs)
xgt5 and ygt4
xgt4 and y3
x3
x6
xy 12
xy gt12
y5
xgt6
ygt5
Data nodes containing points
39
Problems with Existing Approaches for Feature
Indexing
  • Very high dimensionality of feature spaces --
    e.g., shape may define a 100-D space.
  • Traditional multidim. data structures perform
    worse than linear scan at such high
    dimensionality. (dimensionality curse)
  • Arbitrary distance functions-- e.g., distance
    functions may change across iterations of
    relevance feedback.
  • Traditional multidim. data structures support a
    fixed distance measure -- usually euclidean (L2)
    or Lmax.
  • No support for Multi-point Queries -- as in query
    expansion.
  • Executing K-NN for each query point and merging
    results to generate K-NN for multi-point query is
    very expensive.
  • No Support for Refinement
  • query in the following iterations do not diverge
    greatly from query in previous iterations. Effort
    spent in previous iterations should be exploited
    for evaluating K-NN in future iterations

40
High Dimensional Feature Indexing
  • Multidim. Data Structures
  • design data structures that scale to high dim.
    spaces
  • Existing proposals perform worse than linear scan
    over gt 10 dim. Spaces Weber, et al., VLDB 98
  • Fundamental Limitation dimensionality beyond
    which linear scan wins over indexing! (approx.
    610)
  • Dimensionality Reduction
  • transform points in high dim. space to low dim.
    space
  • works well when data correlated into a few
    dimensions only
  • difficult to manage in dynamic environments

41
Classification of Multidimensional Index
Structures
  • Data Partitioning (DP)
  • Bounding Region (BR) Based e.g., R-tree, X-tree,
    SS-tree, SR-tree, M-tree
  • All k dim. used to represent partitioning
  • Poor scalability to dimensionality due to high
    degree of overlap and low fanout at high
    dimensions
  • seq. scan wins for gt 10D
  • Space Partitioning(SP)
  • Based on disjoint partitioning of space e.g.,
    KDB-tree, hB-tree, LSDh-tree, VP tree, MVP tree
  • no overlap and fanout independent of dimensions
  • Poor scalability to dimensionality due to either
    poor storage utilization or redundant information
    storage requirements.

42
Classification of Multidimensional Data
Structures
43
Hybrid Tree Space Partitioning (SP) instead of
Data Partitioning (DP)
R1
R2
R3
R4
R1 R2 R3 R4
dim1 pos3
Non-leaf nodes of hybrid tree organized as
kd-tree
dim2 pos3
dim2 pos2
Data Points
Data Points
Data Points
A
B
C
D
44
Splitting of Non-Leaf Nodes (Easy case)
F
F
B
C
B
C
4
3
Clean split possible without violating node
utilization
D
E
D
E
A
A
(0,0)
4
6
2
dim1 pos4
dim1 pos4
dim2 pos3
dim2 pos4
dim2 pos3
dim2 pos4
dim1 pos2
A
F
dim1 pos6
dim1 pos2
A
F
dim1 pos6
D
B
C
E
D
B
C
E
45
Splitting of Non-Leaf Nodes (Difficult case)
Clean split not possible without violating node
util.
Always clean split Downward cascading
splits (empty nodes)
Complex splits (space overhead tree becomes
large)
Allow Overlap (avoid by relaxing node util,
otherwise minimize overlap) (Hybrid Tree)
46
Splitting of Non-Leaf Nodes (Difficult case)
5
4
dim2 pos3,4
3
2
(0,0)
4
6
2
7
dim1 pos4,4
dim1 pos4,4
dim1 pos4,4
dim2 pos3,3
dim2 pos4,4
dim1 pos2,2
dim1 pos6,6
dim2 pos5,5
A
dim1 pos2,2
dim1 pos6,6
dim2 pos5,5
A
dim2 pos2,2
dim1 pos7,7
D
B
C
G
dim1 pos7,7
dim2 pos2,2
B
C
D
G
E
F
H
I
E
F
H
I
47
Choosing Split dimension and position EDA
(Expected Disk Accesses) Analysis
Consider a range (cube) query, side length r
along each dimension
Split node along this
Node BR
Node BR expanded by (r/2) on each side along
each dimension (Minkowski Sum)
r
w
wr
Prob. of range query accessing node (assuming
(0,1) space and uniform query distribution)
Prob. of range query accessing both nodes
after split (increase in EDA)
Choose split dimension and position that
minimizes increase in EDA
48
Choosing Split dimension and position(based on
EDA analysis)
  • Data Node Splitting
  • Spilt dimension split along maximum spread
    dimension
  • Split position split as close to the middle as
    possible (without violating node utilization)
  • Index Node Splitting
  • Split dimension argminj ò P(r) (wj r)/ (sj
    r) dr
  • depends of the distribution of the query size
  • argminj (wj R)/ (sj R) when all queries
    are cubes with side length R
  • Split position avoid overlap if possible, else
    minimize as much overlap as possible without
    violating utilization constraints

49
Dead Space Elimination
R1
R2
Space Partitioning (Hybrid tree) Without dead
space elimination
D
B
R3
R4
A
C
Data Partitioning (R-tree) No dead space
Space Partitioning (Hybrid tree) With dead space
elimination
D
B
A
C
50
Dead Space Elimination
  • Live space encoding using 3 bit precision
    (ELSPRECISION3)
  • Encoded Live Space (ELS) BR (001,001,101,111)
  • Bits required 2numdimsELSPRECISION
  • Compression ELSPRECISION/32
  • Only applied to leaf nodes

111
110
101
100
011
010
001
000
000 001 010 011 100 101 110 111
51
Tree operations
  • Search
  • Point, Range, NN-search, distance-based search as
    in DP-techniques
  • Reason BR representation can be derived from
    kd-tree representation
  • Exploit tree organization (pruning) for fast
    intra-node search
  • Insertion
  • recursively choose space partition that contains
    the point
  • break tries arbitrarily
  • no volume computation (otherwise floating point
    exception at high dims)
  • Deletion
  • details in thesis

52
Mapping of kd-tree representation to Bounding
Rectangle (BR) representation
Search algorithms developed for R-tree can be
used directly
dim2 pos3,4
dim1 pos4,4
dim1 pos4,4
A
dim1 pos6,6
dim1 pos2,2
dim2 pos5,5
dim2 pos2,2
D
dim1 pos7,7
B
C
G
E
F
H
I
53
(No Transcript)
54
Other Queries (Lp metrics and weights)
Range Queries
k-NN queries
1
3
2
Euclidean distance
Weighted Euclidean
Weighted Manhattan
55
Advantages of Hybrid Tree
  • More scalable to high dimensionalities than
  • DP techniques (R-tree like index structures)
  • Fanout independent of dimensionality high fanout
    even at high dims
  • Faster intranode search due to kd-tree-based
    organization
  • No overlap at lowest level, low overlap at higher
    levels
  • SP techniques
  • Guaranteed node utilization
  • No costly cascading splits
  • EDA-optimal choice of splits
  • Supports arbitrary distance functions


56
Experiments
  • Effect of ELS encoding
  • Test scalability of hybrid tree to high
    dimensionalities
  • Compare performance of hybrid tree with SR-tree
    (data partitioning), hB-tree (space partitioning)
    and sequential scan
  • Data Sets
  • Fourier Data set (16-d Fourier vectors, 1.2
    million)
  • Color Histograms for COREL images (64-d color
    histograms from 70K images)

57
Experimental Results
Factor of Sequential IO to Random IO accounted for
58
Summary of Results
  • Hybrid Tree scales well to high dimensionalities
  • Outperforms linear scan even at 64-d (mainly due
    to significantly lower CPU cost)
  • Order of magnitude better than SR-tree (DP) and
    hB-tree (SP) both in terms of I/O and CPU costs
    at all dimensionalities
  • Performance gap increases with the increase in
    dimensionality
  • Efficiently supports arbitrary distance functions

59
Exploiting Correlation in Data
  • Dimensionality curse persists
  • To achieve further scalability, dimensionality
    reduction (DR) commonly used in conjuction with
    index structures
  • Exploit correlations in high dimensional data

Expected graph (hand drawn)
60
Dimensionality Reduction
  • First perform Principal Component Analysis (PCA),
    then build index on reduced space
  • Distances in reduced space lower bound distances
    in original space
  • Range queries
  • map point, range query with same range, eliminate
    false positives
  • k-NN query (a bit more complex)
  • DR increases efficiency, not quality of answers

First Principal Component (PC)
r
Reduced space
r
61
Global Dimensionality Reduction (GDR)
First Principal Component (PC)
First PC
  • works well only when data is globally correlated
  • otherwise too many false positives result in high
    query cost
  • solution find local correlations instead of
    global correlation

62
Local Dimensionality Reduction (LDR)
GDR
LDR
First PC
63
Overview of LDR Technique
  • Identify Correlated Clusters in dataset
  • Definition of correlated clusters
  • Bounding loss of information
  • Clustering Algorithm
  • Indexing the Clusters
  • Index Structure
  • Point Search, Range search and k-NN search
  • Insertion and deletion

64
Correlated Cluster
Centroid of cluster (projection of mean on
eliminated dim)
Mean of all points in cluster
First PC (retained dim.)
Second PC (eliminated dim.)
A set of locally correlated points ltPCs,
subspace dim, centroid, pointsgt
65
Reconstruction Distance
Centroid of cluster
Projection of Q on eliminated dim
Point Q
First PC (retained dim)
Reconstruction Distance(Q,S)
Second PC (eliminated dim)
66
Reconstruction Distance Bound
Centroid
MaxReconDist
First PC (retained dim)
MaxReconDist
Second PC (eliminated dim)
ReconDist(P, S) MaxReconDist, " P in S
67
Other constraints
  • Dimensionality bound A cluster must not retain
    any more dimensions necessary and subspace
    dimensionality MaxDim
  • Size bound number of points in the cluster ³
    MinSize

68
Clustering Algorithm Step 1 Construct Spatial
Clusters
  • Choose a set of well-scattered points as
    centroids (piercing set) from random sample
  • Group each point P in the dataset with its
    closest centroid C if the Dist(P,C) e

69
Clustering Algorithm Step 2 Choose PCs for each
cluster
  • Compute PCs

70
Clustering AlgorithmStep 3 Compute Subspace
Dimensionality
  • Assign each point to cluster that needs min dim.
    to accommodate it
  • Subspace dim. for each cluster is the min dims
    to retain to keep most points

71
Clustering Algorithm Step 4 Recluster points
  • Assign each point P to the cluster S such that
    ReconDist(P,S) MaxReconDist
  • If multiple such clusters, assign to first
    cluster (overcomes splitting problem)

Empty clusters
72
Clustering algorithmStep 5 Map points
  • Eliminate small clusters
  • Map each point to subspace (also store
    reconstruction dist.)

Map
73
Clustering algorithmStep 6 Iterate
  • Iterate for more clusters as long as new clusters
    are being found among outliers
  • Overall Complexity 3 passes, O(ND2K)

74
Experiments (Part 1)
  • Precision Experiments
  • Compare information loss in GDR and LDR for same
    reduced dimensionality
  • Precision Orig. Space Result/Reduced Space
    Result (for range queries)
  • Note precision measures efficiency, not answer
    quality

75
Datasets
  • Synthetic dataset
  • 64-d data, 100,000 points, generates clusters in
    different subspaces (cluster sizes and subspace
    dimensionalities follow Zipf distribution),
    contains noise
  • Real dataset
  • 64-d data (8X8 color histograms extracted from
    70,000 images in Corel collection), available at
    http//kdd.ics.uci.edu/databases/CorelFeatures

76
Precision Experiments (1)
77
Precision Experiments (2)
78
Index structure
Root containing pointers to root of each cluster
index (also stores PCs and subspace dim.)
Set of outliers (no index sequential scan)
Index on Cluster 1
Index on Cluster K
Properties (1) disk based
(2) height 1 height(original space index)
(3) almost balanced
79
Experiments (Part 2)
  • Cost Experiments
  • Compare linear scan, Original Space Index(OSI),
    GDR and LDR in terms of I/O and CPU costs. We
    used hybrid tree index structure for OSI, GDR and
    LDR.
  • Cost Formulae
  • Linear Scan I/O cost (rand accesses)file_size/1
    0, CPU cost
  • OSI I/O costnum index nodes visited, CPU cost
  • GDR I/O costindex costpost processing cost (to
    eliminate false positives), CPU cost
  • LDR I/O costindex costpost processing
    costoutlier_file_size/10, CPU cost

80
I/O Cost (random disk accesses)
81
CPU Cost (only computation time)
82
Summary of LDR
  • LDR is a powerful dimensionality reduction
    technique for high dimensional data
  • reduces dimensionality with lower loss in
    distance information compared to GDR
  • achieves significantly lower query cost compared
    to linear scan, original space index and GDR
  • LDR is a general technique to deal with high
    dimensionality
  • our experience shows high dimensional datasets
    often have local correlations - LDR is the only
    technique that can discover/exploit it
  • applications beyond indexing selectivity
    estimation, data mining etc. on high dimensional
    data (currently exploring)
Write a Comment
User Comments (0)
About PowerShow.com