Indexing Multidimensional Feature Spaces

About This Presentation

Title:

Indexing Multidimensional Feature Spaces

Description:

Queries over Feature Spaces ... quite complex and ... to low dim. space works well when data correlated into a few dimensions only difficult to manage ... – PowerPoint PPT presentation

Number of Views:157

Avg rating:3.0/5.0

Slides: 82

Provided by: Shara160

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Indexing Multidimensional Feature Spaces

1
Indexing Multidimensional Feature Spaces

Overview of Multidimensional Index Structure
Hybrid Tree, Chakrabarti et. al. ICDE 1999
Local Dimensionality Reduction, Chakrabarti et.
al. VLDB 2000

2
Queries over Feature Spaces

Consider a d-dimensional feature space
color histogram, texture,
Nature of Queries
range queries objects that reside within the
region specified in the query
K-nearest neighbor queries objects that are
closest to a query object based on a distance
metric
Approx. nearest neighbor queries retrieved
object is within (1 epsilon) of the real nearest
neighbor.
All-pair (similarity join) queries retrieve all
pairs of objects within a epsilon threshold.
A search algorithm may include
false positives objects that do not meet the
query condition, but are retrieved anyway. We
tend to minimize false positives
false negatives objects that meet the query
condition but are not returned. Usually,
approaches avoid false negatives

3
Approach Utilize Single Dimensional Index

Index on attributes independently
Project query range to each attribute determine
pointers.
Intersect pointers
go to the database and retrieve objects in the
intersection.

May result in very high I/O cost
4
Multiple Key Index

Index on one attribute provides pointers to an
index on the other

Cannot support partial match queries on second
attribute
performance of range search not much better
compared to independent attribute approach
the secondary indices may be of different sizes
-- specifically some of them may be very small

Index on first attribute
Index on second attribute
5
R-tree Data Structure

Extension of B-tree to multidimensional space.
Paginated, balanced, guaranteed storage
utilization.
Can support both point data and data with spatial
extent
Groups objects into possibly overlapping clusters
(rectangles in our case)
Search for range query proceeds along all paths
that overlap with the query.

6
R-tree Insert Object E

Step I1
Chooseleaf L to Insert E / find position to
insert/
Step I2
If L has room install E
Else SplitNode(L)
Step I3
Adjust Tree / propagate Changes/
Step I4
if node split propagates to root adjust height of
tree

7
ChooseLeaf

Step CL1
Set N to be root
Step CL2
If N is a leaf return N
Step CL3
If N is not a root, let F be an entry whose
rectangle needs least enlargement to include
object
break ties by choosing smaller rectangle
Step CL4
Set N to be child node pointed by entry F
goto Step CL2

8
Split Node

Given a node split it into two nodes which are
each atleast half full
Multiple Objectives
minimize overlap
minimize covered area
R-tree minimizes covered area
What is an optimal criteria???

Minimize covered area
Minimize overlap
9
Minimizing Covered Area

Group objects into 2 parts such that the covered
area is minimized
NP Hard!!
Hence use heuritics
Two heuristics explored
quadratic and linear

10
Basic Split Strategy

/ Divide the set of M1 entries into 2 groups G1
and G2 /
PickSeeds for G1 and G2
Invoke PickNext to assign an object to a group
recursively until either all objects assigned or
one of the groups becomes half full.
If one group gets half full assign rest of the
objects to the other group.

11
Quadratic Split

PickSeed
for each pair of entries E1 and E2 compose a
rectangle J including E1.rect and E2.rect
let d area(J) - area(E1.rect) - area(E2.rect)
/ d is wasted space /
Choose the most wasteful pair with largest d as
seeds for groups G1 and G2.
PickNext /select next entry to put in a group /
Determine cost of putting each entry in the group
G1 and G2
for each unassigned entry calculate
d1 area increase required in the covering
rectangle in Group G1 to include the entry
d2 area increase required in the covering
rectangle in Group G2 to include the entry.
Select entry with greatest preference for a group
choose any entry with the maximum difference
between d1 and d2

12
Linear Split

PickSeed
find extreme rectangles along each dimension
find entries with the highest low side and the
lowest high side
record the separation
Normalize the separation by width of extent along
the dimension
Choose as seeds the pair that has the greatest
normalized distance along any dimension
PickNext
randomly choose entry to assign

13
R-tree Search (Range Search on range S)

Start from root
If node T is not leaf
check entries E in T to determine if E.rectangle
overlaps S
for all overlapping entries invoke search
recursively
If T is leaf
check each entry to see if it entry satisfies
range query

14
R-tree Delete

Step D1
find the object and delete entry
Step D2
Condense Tree
Step D3
if root has 1 node shorten tree height

15
Condense Tree

If node is underful
delete entry from parent and add to a set Q
Adjust bounding rectangle of parent
Do the above recursively for all levels
Reinsert all the orphaned entries
insert entries at the same level they were
deleted.

16
Other Multidimensional Data Structures

Many generalizations of R-tree
different splitting criteria
different shapes of clusters (e.g., d-dimensional
spheres)
adding redundancy to reduce search cost
store objects in multiple rectangles instead of
a single rectangle to reduce cost of retrieval.
But now insert has to store objects in many
clusters. This strategy also increases overlap
causing search performance to detoriate.
Space Partitioning Data Structures
unlike R-tree which group objects into possibly
overlapping clusters, these methods attempt to
partition space into non-overlapping regions.
E.g., KD tree, quad tree, grid files, KD-Btree,
HB-tree, hybrid tree.
Space filling curves
superimpose an ordering on multidimensional space
that preserves proximity in multidimensional
space. (Z-ordering, hilbert ordering)
Use a B-tree as an index on that ordering

17
KD-tree

A main memory data structure based on binary
search trees
can be adapted to block model of storage
(KD-Btree)
Levels rotate among the dimensions, partitioning
the space based on a value for that dimension
KD-tree is not necessarily balanced.

18
KD-Tree Example
X5
y6
y5
Y6
x8
x7
x3
Y2
y2
19
KD-Tree Operations

Search
straightforward. Just descend down the tree like
binary search trees.
Insertion
lookup record to be inserted, reaching the
appropriate leaf.
If room on leaf, insert in the leaf block
Else, find a suitable value for the appropriate
dimension and split the leaf block

20
Adapting KD Tree to Block Model

Similar to B-tree, tree nodes split many ways
instead of two ways
Risk
insertion becomes quite complex and expensive.
No storage utilization guarantee since when a
higher level node splits, the split has to be
propagated all the way to leaf level resulting in
many empty blocks.
Pack many interior nodes (forming a subtree) into
a block.
Risk
it may not be feasible to group nodes at lower
level into a block productively.
Many interesting papers on how to optimally pack
nodes into blocks recently published.

21
Quad Tree

Nodes split along all dimensions simultaneously
Division fixed by quadrants
As with KD-tree we cannot make quadtree levels
uniform

22
Quad Tree Example
X7
X3
SW
NW
SE
NE
X5
X8
23
Quad Tree Operations

Insert
Find Leaf node to which point belongs
If room, put it there
Else, make the leaf an interior node and give it
leaves for each quadrant. Split the points among
the new leaves.
Search
straighforward just descend down the right
subtree

24
Grid Files

Space Partitioning strategy but different from a
tree.
Select dividers along each dimension. Partition
space into cells
Unlike KD-tree dividers cut all the way.
Each cell corresponds to 1 disk page.
Many cells can point to the same page.
Cell directory potentially exponential in the
number of dimensions

25
Grid File Implementation

Maintain linear scales for each dimension that
contain split positions for the dimension
Cell directory implemented as a multidimensional
array.
/ can be large and may not fit in memory /

26
Grid File Search

Exact Match Search at most 2 I/Os assuming
linear scales fit in memory.
First use liner scales to determine the index
into the cell directory
access the cell directory to retrieve the bucket
address (may cause 1 I/O if cell directory does
not fit in memory)
access the appropriate bucket (1 I/O)
Range Queries
use linear scales to determine the index into the
cell directory.
Access the cell directory to retrieve the bucket
addresses of buckets to visit.
Access the buckets.

27
Grid File Insert

Determine the bucket into which insertion must
occur.
If space in bucket, insert.
Else, split bucket
how to choose a good dimension to split?
If bucket split causes a cell directory to split
do so and adjust linear scales.
/ notice that cell directory split results in
p(d-1) new entries to be created in cell
directory /
insertion of these new entries potentially
requires a complete reorganization of the cell
directory--- expensive!!!

28
Grid File Insert

Inserting a new split position will require the
cell directory to increase by 1 column. In d-dim
space, it will cause p(d-1) new entries to be
created

29
Space Filling Curve

Assumption
finite precision in representing each coordinate.

B
A
Z(A) shuffle(x_A, y_A) shuffle(00,11) 0101
5 Z(B) 11 3 (common prefix to all its
blocks) Z(C1) 0010 2 Z(C2) 1000 8
00 01 10 11
00 01 10 11
C
30
Deriving Z-Values for a Region

Obtain a quad-tree decomposition of an object by
recursively dividing it into blocks until blocks
are homogeneous.

11
01
Objects representation is 0001, 0011,01
00
10
00
11
01
00
11
31
Disk Based Storage

For disk storage, represent object based on its
Z-value
Use a B-tree index.
Range Query
translate query range to Z values
search B-tree with Z-values of data regions for
matches

32
Nearest Neighbor Search

Retrieve the nearest neighbor of query point Q
Simple Strategy
convert the nearest neighbor search to range
search.
Guess a range around Q that contains at least one
object say O
if the current guess does not include any
answers, increase range size until an object
found.
Compute distance d between Q and O
re-execute the range query with the distance d
around Q.
Compute distance of Q from each retrieved object.
The object at minimum distance is the nearest
neighbor!!! Why?
Issues how to guess range, the retrieval may be
sub-optimal if incorrect range guessed. Becomes a
problem in high dimensional spaces.

33
Nearest Neighbor Search using Range Searches
Distance between Q and A
b
Initial range search
Q
A
Revised range search
A optimal strategy that results in minimum number
of I/Os possible using priority queues.
34
Alternative Strategy to Evaluating K-NN
Mindist(Q,B)

Let Q be the query point.
Traverse nodes in the data structure in the order
of MINDIST(Q,N), where
MINDIST(Q,N) dist(Q,N), if N is an object.
MINDIST(Q,N) minimum distance between Q and any
object in N, if N is an interior node.

B
Q
A
Mindist(Q, A)
Mindist(Q,C)
C
35
MINDIST Between Rectangle and Point
Q
Q
T
Q
S
36
Generalized Search Trees

Motivation
disparate applications require different data
structures and access methods.
Requires separate code for each data structure to
be integrated with the database code
too much effort.
Vendors will not spend time and energy unless
application very important or data structure has
general applicability.
Generalized search trees abstract the notion of
data structure into a template.
Basic observation most data structures are
similar and a lot of book keeping and
implementation details are the same.
Different data structures can be seen as
refinements of basic GiST structure. Refinements
specified by providing a registering a bunch of
functions per data structure to the GiST.

37
GiST supports extensibility both in terms of data
types and queries

GiST is like a template - it defines its
interface in terms of ADT rather than physical
elements (like nodes, pointers etc.)
The access method (AM) can customize GiST by
defining his or her own ADT class i.e. you just
define the ADT class, you have your access method
implemented!
No concern about search/insertion/deletion,
structural modifications like node splits etc.

38
Integrating Multidimensional Index Structures as
AMs in DBMS
Generalized Search Trees (GiSTs)
xgt5 and ygt4
xgt4 and y3
x3
x6
xy 12
xy gt12
y5
xgt6
ygt5
Data nodes containing points
39
Problems with Existing Approaches for Feature
Indexing

Very high dimensionality of feature spaces --
e.g., shape may define a 100-D space.
Traditional multidim. data structures perform
worse than linear scan at such high
dimensionality. (dimensionality curse)
Arbitrary distance functions-- e.g., distance
functions may change across iterations of
relevance feedback.
Traditional multidim. data structures support a
fixed distance measure -- usually euclidean (L2)
or Lmax.
No support for Multi-point Queries -- as in query
expansion.
Executing K-NN for each query point and merging
results to generate K-NN for multi-point query is
very expensive.
No Support for Refinement
query in the following iterations do not diverge
greatly from query in previous iterations. Effort
spent in previous iterations should be exploited
for evaluating K-NN in future iterations

40
High Dimensional Feature Indexing

Multidim. Data Structures
design data structures that scale to high dim.
spaces
Existing proposals perform worse than linear scan
over gt 10 dim. Spaces Weber, et al., VLDB 98
Fundamental Limitation dimensionality beyond
which linear scan wins over indexing! (approx.
610)

Dimensionality Reduction
transform points in high dim. space to low dim.
space
works well when data correlated into a few
dimensions only
difficult to manage in dynamic environments

41
Classification of Multidimensional Index
Structures

Data Partitioning (DP)
Bounding Region (BR) Based e.g., R-tree, X-tree,
SS-tree, SR-tree, M-tree
All k dim. used to represent partitioning
Poor scalability to dimensionality due to high
degree of overlap and low fanout at high
dimensions
seq. scan wins for gt 10D

Space Partitioning(SP)
Based on disjoint partitioning of space e.g.,
KDB-tree, hB-tree, LSDh-tree, VP tree, MVP tree
no overlap and fanout independent of dimensions
Poor scalability to dimensionality due to either
poor storage utilization or redundant information
storage requirements.

42
Classification of Multidimensional Data
Structures
43
Hybrid Tree Space Partitioning (SP) instead of
Data Partitioning (DP)
R1
R2
R3
R4
R1 R2 R3 R4
dim1 pos3
Non-leaf nodes of hybrid tree organized as
kd-tree
dim2 pos3
dim2 pos2
Data Points
Data Points
Data Points
A
B
C
D
44
Splitting of Non-Leaf Nodes (Easy case)
F
F
B
C
B
C
4
3
Clean split possible without violating node
utilization
D
E
D
E
A
A
(0,0)
4
6
2
dim1 pos4
dim1 pos4
dim2 pos3
dim2 pos4
dim2 pos3
dim2 pos4
dim1 pos2
A
F
dim1 pos6
dim1 pos2
A
F
dim1 pos6
D
B
C
E
D
B
C
E
45
Splitting of Non-Leaf Nodes (Difficult case)
Clean split not possible without violating node
util.
Always clean split Downward cascading
splits (empty nodes)
Complex splits (space overhead tree becomes
large)
Allow Overlap (avoid by relaxing node util,
otherwise minimize overlap) (Hybrid Tree)
46
Splitting of Non-Leaf Nodes (Difficult case)
5
4
dim2 pos3,4
3
2
(0,0)
4
6
2
7
dim1 pos4,4
dim1 pos4,4
dim1 pos4,4
dim2 pos3,3
dim2 pos4,4
dim1 pos2,2
dim1 pos6,6
dim2 pos5,5
A
dim1 pos2,2
dim1 pos6,6
dim2 pos5,5
A
dim2 pos2,2
dim1 pos7,7
D
B
C
G
dim1 pos7,7
dim2 pos2,2
B
C
D
G
E
F
H
I
E
F
H
I
47
Choosing Split dimension and position EDA
(Expected Disk Accesses) Analysis
Consider a range (cube) query, side length r
along each dimension
Split node along this
Node BR
Node BR expanded by (r/2) on each side along
each dimension (Minkowski Sum)
r
w
wr
Prob. of range query accessing node (assuming
(0,1) space and uniform query distribution)
Prob. of range query accessing both nodes
after split (increase in EDA)
Choose split dimension and position that
minimizes increase in EDA
48
Choosing Split dimension and position(based on
EDA analysis)

Data Node Splitting
Spilt dimension split along maximum spread
dimension
Split position split as close to the middle as
possible (without violating node utilization)
Index Node Splitting
Split dimension argminj ò P(r) (wj r)/ (sj
r) dr
depends of the distribution of the query size
argminj (wj R)/ (sj R) when all queries
are cubes with side length R
Split position avoid overlap if possible, else
minimize as much overlap as possible without
violating utilization constraints

49
Dead Space Elimination
R1
R2
Space Partitioning (Hybrid tree) Without dead
space elimination
D
B
R3
R4
A
C
Data Partitioning (R-tree) No dead space
Space Partitioning (Hybrid tree) With dead space
elimination
D
B
A
C
50
Dead Space Elimination

Live space encoding using 3 bit precision
(ELSPRECISION3)
Encoded Live Space (ELS) BR (001,001,101,111)
Bits required 2numdimsELSPRECISION
Compression ELSPRECISION/32
Only applied to leaf nodes

111
110
101
100
011
010
001
000
000 001 010 011 100 101 110 111
51
Tree operations

Search
Point, Range, NN-search, distance-based search as
in DP-techniques
Reason BR representation can be derived from
kd-tree representation
Exploit tree organization (pruning) for fast
intra-node search
Insertion
recursively choose space partition that contains
the point
break tries arbitrarily
no volume computation (otherwise floating point
exception at high dims)
Deletion
details in thesis

52
Mapping of kd-tree representation to Bounding
Rectangle (BR) representation
Search algorithms developed for R-tree can be
used directly
dim2 pos3,4
dim1 pos4,4
dim1 pos4,4
A
dim1 pos6,6
dim1 pos2,2
dim2 pos5,5
dim2 pos2,2
D
dim1 pos7,7
B
C
G
E
F
H
I
53
(No Transcript)
54
Other Queries (Lp metrics and weights)
Range Queries
k-NN queries
1
3
2
Euclidean distance
Weighted Euclidean
Weighted Manhattan
55
Advantages of Hybrid Tree

More scalable to high dimensionalities than
DP techniques (R-tree like index structures)
Fanout independent of dimensionality high fanout
even at high dims
Faster intranode search due to kd-tree-based
organization
No overlap at lowest level, low overlap at higher
levels
SP techniques
Guaranteed node utilization
No costly cascading splits
EDA-optimal choice of splits
Supports arbitrary distance functions

56
Experiments

Effect of ELS encoding
Test scalability of hybrid tree to high
dimensionalities
Compare performance of hybrid tree with SR-tree
(data partitioning), hB-tree (space partitioning)
and sequential scan
Data Sets
Fourier Data set (16-d Fourier vectors, 1.2
million)
Color Histograms for COREL images (64-d color
histograms from 70K images)

57
Experimental Results
Factor of Sequential IO to Random IO accounted for
58
Summary of Results

Hybrid Tree scales well to high dimensionalities
Outperforms linear scan even at 64-d (mainly due
to significantly lower CPU cost)
Order of magnitude better than SR-tree (DP) and
hB-tree (SP) both in terms of I/O and CPU costs
at all dimensionalities
Performance gap increases with the increase in
dimensionality
Efficiently supports arbitrary distance functions

59
Exploiting Correlation in Data

Dimensionality curse persists
To achieve further scalability, dimensionality
reduction (DR) commonly used in conjuction with
index structures
Exploit correlations in high dimensional data

Expected graph (hand drawn)
60
Dimensionality Reduction

First perform Principal Component Analysis (PCA),
then build index on reduced space
Distances in reduced space lower bound distances
in original space
Range queries
map point, range query with same range, eliminate
false positives
k-NN query (a bit more complex)
DR increases efficiency, not quality of answers

First Principal Component (PC)
r
Reduced space
r
61
Global Dimensionality Reduction (GDR)
First Principal Component (PC)
First PC

works well only when data is globally correlated
otherwise too many false positives result in high
query cost
solution find local correlations instead of
global correlation

62
Local Dimensionality Reduction (LDR)
GDR
LDR
First PC
63
Overview of LDR Technique

Identify Correlated Clusters in dataset
Definition of correlated clusters
Bounding loss of information
Clustering Algorithm
Indexing the Clusters
Index Structure
Point Search, Range search and k-NN search
Insertion and deletion

64
Correlated Cluster
Centroid of cluster (projection of mean on
eliminated dim)
Mean of all points in cluster
First PC (retained dim.)
Second PC (eliminated dim.)
A set of locally correlated points ltPCs,
subspace dim, centroid, pointsgt
65
Reconstruction Distance
Centroid of cluster
Projection of Q on eliminated dim
Point Q
First PC (retained dim)
Reconstruction Distance(Q,S)
Second PC (eliminated dim)
66
Reconstruction Distance Bound
Centroid
MaxReconDist
First PC (retained dim)
MaxReconDist
Second PC (eliminated dim)
ReconDist(P, S) MaxReconDist, " P in S
67
Other constraints

Dimensionality bound A cluster must not retain
any more dimensions necessary and subspace
dimensionality MaxDim
Size bound number of points in the cluster ³
MinSize

68
Clustering Algorithm Step 1 Construct Spatial
Clusters

Choose a set of well-scattered points as
centroids (piercing set) from random sample
Group each point P in the dataset with its
closest centroid C if the Dist(P,C) e

69
Clustering Algorithm Step 2 Choose PCs for each
cluster

Compute PCs

70
Clustering AlgorithmStep 3 Compute Subspace
Dimensionality

Assign each point to cluster that needs min dim.
to accommodate it
Subspace dim. for each cluster is the min dims
to retain to keep most points

71
Clustering Algorithm Step 4 Recluster points

Assign each point P to the cluster S such that
ReconDist(P,S) MaxReconDist
If multiple such clusters, assign to first
cluster (overcomes splitting problem)

Empty clusters
72
Clustering algorithmStep 5 Map points

Eliminate small clusters
Map each point to subspace (also store
reconstruction dist.)

Map
73
Clustering algorithmStep 6 Iterate

Iterate for more clusters as long as new clusters
are being found among outliers
Overall Complexity 3 passes, O(ND2K)

74
Experiments (Part 1)

Precision Experiments
Compare information loss in GDR and LDR for same
reduced dimensionality
Precision Orig. Space Result/Reduced Space
Result (for range queries)
Note precision measures efficiency, not answer
quality

75
Datasets

Synthetic dataset
64-d data, 100,000 points, generates clusters in
different subspaces (cluster sizes and subspace
dimensionalities follow Zipf distribution),
contains noise
Real dataset
64-d data (8X8 color histograms extracted from
70,000 images in Corel collection), available at
http//kdd.ics.uci.edu/databases/CorelFeatures

76
Precision Experiments (1)
77
Precision Experiments (2)
78
Index structure
Root containing pointers to root of each cluster
index (also stores PCs and subspace dim.)
Set of outliers (no index sequential scan)
Index on Cluster 1
Index on Cluster K
Properties (1) disk based
(2) height 1 height(original space index)
(3) almost balanced
79
Experiments (Part 2)

Cost Experiments
Compare linear scan, Original Space Index(OSI),
GDR and LDR in terms of I/O and CPU costs. We
used hybrid tree index structure for OSI, GDR and
LDR.
Cost Formulae
Linear Scan I/O cost (rand accesses)file_size/1
0, CPU cost
OSI I/O costnum index nodes visited, CPU cost
GDR I/O costindex costpost processing cost (to
eliminate false positives), CPU cost
LDR I/O costindex costpost processing
costoutlier_file_size/10, CPU cost

80
I/O Cost (random disk accesses)
81
CPU Cost (only computation time)
82
Summary of LDR

LDR is a powerful dimensionality reduction
technique for high dimensional data
reduces dimensionality with lower loss in
distance information compared to GDR
achieves significantly lower query cost compared
to linear scan, original space index and GDR
LDR is a general technique to deal with high
dimensionality
our experience shows high dimensional datasets
often have local correlations - LDR is the only
technique that can discover/exploit it
applications beyond indexing selectivity
estimation, data mining etc. on high dimensional
data (currently exploring)

Write a Comment

User Comments (0)

About PowerShow.com

Indexing Multidimensional Feature Spaces - PowerPoint PPT Presentation

Indexing Multidimensional Feature Spaces

Queries over Feature Spaces ... quite complex and ... to low dim. space works well when data correlated into a few dimensions only difficult to manage ... – PowerPoint PPT presentation