Spatial Indexing - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Spatial Indexing

Description:

... Mesh invented at Johns Hopkins University now used by several projects. ... Possible do-it-yourself approach: split table into separate DBMS instances ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 35

Provided by: GilesMur7

Category:

more less

Transcript and Presenter's Notes

Title: Spatial Indexing

1
Spatial Indexing

Clive Page

2
(No Transcript)
3
Formats of Raw Data

Radio
Complex visibility for each polarisation at set
of points sampling the complex (u,v) plane.
Infra-red, Optical, Ultra-violet
Images from 1k1k to 18k20k, collected at
intervals of a few seconds to several minutes.
X-ray, Gamma-ray
Photon-even lists get properties (x, y, time,
energy) for each detected photon. May get a list
of millions of these over an integration period
of a few hours.

4
Formats of Reduced Data

Images
Time-series
Spectra
Catalogues of astronomical objects
Vital to cross-identify objects from different
wavebands, basis for many subsequent data mining
investigations.
Problem tables can be long or wide

Catalogue Rows Columns
Optical USNO-B 1,045,913,669 30
Infra-red 2MASS 470,992,970 60
X-ray 1XMM 56,711 379
5
Main Functionality

SELECT objects in a given small patch of sky
In a rectangle to cover same region as an image
In a circle to cover a radius around point of
interest (also known as search in a cone).
In a polygon e.g. around extended object.
Spatial JOIN Cross-match objects from, for
example, two wavebands or two epochs.
Principal matching criterion is overlap of
error-circles.
Often important to find objects in one table
which are NOT matched in the other need a left
outer join.

6
Functionality (continued)

Self JOIN useful in some applications, for
example
Search for clusters of galaxies
Search for double stars
Find spatial distribution or spatial correlation
functions for particular classes of objects
Those observing with adaptive optics need a
suitable reference star in or near each field,
e.g. might want to find all AGNs near to bright
stars to plan an observing campaign.

7
Current on-line services

Search in a cone many on-line services
Cross-matching (spatial JOIN) few current
examples
Astrobrowse (at GSFC) has pre-computed joins for
a limited number of important catalogues.
Vizier service (CDS) allows cross-match of users
own table of positions with list of catalogues.
But results of list of N sources and M catalogues
are presented as unmerged list of MN separate
tables.
Skyserver joins between SDSS and other tables.
Self-join do not know of any on-line services at
present.

8
Problems handling object catalogues

Positions use spherical-polar coordinates (RA,
Dec)
Right Ascension corresponds to geographic
longitude
Declination corresponds to geographic latitude
There are singularities at the poles and
distortions in the scales everywhere except at
the equator.
RA wraps from 24 hours (360 degrees) to zero.
All object positions are imprecise ? positions
have an error radius.
Distances between points must use a great-circle
distance function not cartesian distance.
Two-dimensional indexing is really needed

9
Indexing Possibilities

Home-brew slicing of the sky
Use B-tree on one spatial axis only
Use 1-d to 2-d mapping function then simple
B-tree
Use true spatial index such as R-tree.

10
Home-brew slicing of sky

Both USNO-B and 2MASS issued as separate files,
each covering 0.1 degree in declination, sorted
in RA within each strip.
Software from Harvard-Smithsonian Center for
Astrophysics (WCSTOOLS) allows fairly efficient
access to data stored in these strips.
WFCSTOOLS is used by many astronomical archives
around the world, (including LEDAS
www.ledas.ac.uk).
Ok for cone-search services, not suitable for
spatial joins.

11
Index one spatial axis only

Widely used, including Vizier collection of 3000
astronomical tables.
Index on declination (avoids RA wrap-around
problem)
Poor (but acceptable) performance on cone-search
E.g. consider USNO-B a table of a billion rows
Typical search/join uses a radius of say 3
arc-seconds.
17 chance of finding a match at a random
position.
Index on declination effectively searches a strip
360 x 6 arc-seconds get around 10,000 rows
matching. Need to check all these to find the (0
or 1) true matches.
Conclusion might gain five orders of magnitude
in efficiency by using a true 2-d index.

12
One-dimensional index spatial join problem

Very inefficient, and efficiency is needed when
joining one large table with another
Join criterion is, typically
SELECT FROM cat1,cat2 WHERE
ABS(dec1 - dec2) lt combinedError
AND
I have not yet found any DBMS with an optimiser
which uses an index when confronted with a join
criterion of this form (or indeed any other
expression for joining on an overlap of error
ranges).

13
Mapping functions (2-d to 1-d)

Cover the space with cells (pixels) and number
them.
Create conventional B-tree on resulting set of
integers.
Each point in the sky maps to a single integer.
An area maps to a set of integers.
A seductive idea
If a small spatial area maps to a smallish range
of integers, then a search over a spatial area
might be done by a B-tree search over a small
integer range.
Various space-filling curves have been used in
the hope that this works in practice, e.g.
Z-order index, Hilbert, Peano curves, and many
others.

14
Hilbert Curve
15
Z-order (bit-interleaved) Mapping Function
16
Space-filling Curves as Mapping Functions

Excellent performance when searching for single
points.
For area searches, the median performance is
adequate, but all curves suffer the same
drawback
There exist cases in which nearby points in space
are very far apart on the curve (in the Z-order
index this corresponds to a high-order bit
flipping).
Performance tests confirm this defect the
worst-case performance is so abysmal that the
average performance is very poor.
Another problem simple cartesian grids also
unsuited to spherical-polar coordinate searches
as there are too many tiny distorted pixels near
the poles.

17
Better Mapping Functions

Aim cover sky uniformly with pixels - integer
pixel-codes.
HEALPix - Hierarchical Equal Area iso-Latitude
Pixelisation invented at ESO for COBE.
Pseudo-square pixels, 2-d arrays of pixels
suitable for analysis of large-scale spatial
structures.
HTM - Hierarchical Triangular Mesh invented at
Johns Hopkins University now used by several
projects.
Triangular pixels.
Hierarchical numbering high-order bits are a
valid HTM index for coarser grid.
Both algorithms perfectly good when searching for
points.

18
Hierarchical Equal Area iso-Latitude Pixelisation
(HEALPix)
19
Hierarchical Triangular Mesh (HTM)
20
Spatial Join using Pixel-code Method

Given table CAT1 with columns
ID1 primary key
RA
DEC
POSERR
MAGNITUDE
etc

Given table CAT2 with columns
ID2 primary key
RA
DEC
POSERR
FLUX
etc

21
Create tables P1, P2 with pixel-code column

Given table CAT1 with columns
ID1 primary key
RA
DEC
POSERR
MAGNITUDE
etc
Create table P1
ID1 foreign key
PCODE1 primary key
Note P1 and P2 have extra rows where
error-circle overlaps two or more pixels.

Given table CAT2 with columns
ID2 primary key
RA
DEC
POSERR
FLUX
etc
Create table P2
ID2 foreign key
PCODE2 primary key

22
Join P1 and P2 on pixel-codes creating PJOIN

Given table CAT1 with columns
ID1 primary key
RA
DEC
POSERR
MAGNITUDE
etc
Create table P1
ID1 foreign key
PCODE1 primary key
Create table PJOIN by joining P1 and P2 on
PCODE1PCODE2
ID1
ID2

Given table CAT2 with columns
ID2 primary key
RA
DEC
POSERR
FLUX
etc
Create table P2
ID2 foreign key
PCODE2 primary key

23
PJOIN table

Join using SELECT DISTINCT to remove any
duplicates (two error-circles may each overlap
two or more pixels).
Table PJOIN identifies all pixels where
error-circles potentially overlap
Circles may or may not actually overlap, may just
be nearby in the same pixels.
Next step create B-tree index on PJOIN(ID1)

24
Use PJOIN table to match catalogue rows

Three-way join then produces required results,
e.g.
SELECT cols FROM CAT1, PJOIN, CAT2
WHERE CAT1.ID1PJOIN.ID1
AND PJOIN.ID2CAT2.ID2
AND (2 asin(sqrt(pow(sin((cat1.dec-
cat2.dec)/2),2) cos(cat1.dec) cos(cat2.dec)
pow(sin((cat1.ra- cat2.ra)/2),2)))
lt cat1.poserrcat2.poserr)
This works, and speed appears good, but more
testing is needed on a wider range of datasets.

25
True Spatial Indexing

Hot topic of research in computer science
departments for more than 20 years
Very many algorithms have been proposed
BANG file, BV-tree, Buddy tree, Cell tree,
G-tree, GBD-tree, Gridfile, hB-tree, kd-tree,
LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid
tree, Q0-tree, Quadtree, R-tree, SKD-tree,
SR-tree, SS-tree, TV-tree, UB-tree.
So many alternatives, but none of them has
properties as good as the B-tree in one dimension
(e.g. compact and efficient, with fairly good
worst-case performance).
R-tree one of the earliest structures, one of the
best.
R-trees are built into several modern DBMS.

26
Spatial Options in current DBMS
DBMS Spatial Index
DB2 Spatial Extender multi-level grid file
Informix R-tree indexing included
Ingres
Interbase/Firebird
Microsoft SQL Server
MySQL R-trees on arbitrary polygons
Microsoft SQL Server
Oracle Spatial option R-trees
Postgres R-trees on rectangular boxes
Sybase Spatial Option (Boeing SQS) R-trees
27
Using R-trees

Must draw a rectangular box outside each error
circle
Boxes get rather extended (along RA axis) near
poles
Need a subsequent filter to remove spurious
matches where rectangles overlap but
error-circles do not.
If the error criterion alters (e.g. user wants
99 probability circle rather than 90) need to
recreate column of boxes, and then recreate the
R-tree index.
Solution always use a box as large as anyone
could possible want, subsequent filtering on
error-circles is still quite cheap.

28
R-tree Performance

Postgres R-tree indexing works as advertised
R-trees are large, creation is slow, e.g. 2 hours
for table of 3.5 million rows.
MySQL latest version allows R-trees on any
polygon (but anything above 4 sides is wasteful).
Very verbose external data format
Works as advertised, not yet measured
performance.
Informix Kalpakis et al. (ADASS conference
report) loaded part of USNO-A2 and found data
load and R-tree creation would have taken 39 days
for the entire table of 500 million rows.

29
Comparison of Pixel-code and R-tree Methods

Advantages
Pcode join seems to be faster (but not yet
benchmarked with identical systems).
Takes up less disc space in total.
Can use any DBMS, not just those with an R-tree
or other spatial data option.
Disadvantages
Additional tables and indices have to be created.
More complex set of joins.
Needs external code as neither HTM or HEALPix can
be expressed as an SQL-callable function (because
they return a variable-length array of integers).

30
Indexing summary

Indexing on just one spatial axis is simply too
inefficient for large tables and cannot support
joins.
R-trees are powerful and easy to use, but index
creation times are a serious cause for concern.
2d?1d mapping functions such as HTM or HEALPix
are more complicated to use but may be faster in
some cases.
AstroGrid will continue work in this area.

31
Idea HTM as Universal Position Locator?

Can choose HTM (or HEALPix) resolution as high as
needed
32-bit words pixels around 24 arcsec on a side.
More pixels than objects in largest current
catalogue
64-bit words pixels around 0.4 milliarcsec on a
side.
Ample precision for some time to come
Maybe every object catalogue should list the UPL
of each celestial object, allowing fast and easy
searches and joins?
Fine for points in sky, but real objects have
finite extent, and some will correspond to a more
than one UPL
Some objects move should pixel-code change with
time?
Reference frame also moves

32
UPLs for extended objects possible work-arounds

Choose a pixel size large enough to encompass the
error region in each case
In some cases this would be very large indeed
Different objects would have different pixel size
Have extra rows in tables for objects crossing
pixel boundaries.
Need to alter original tables, not always
acceptable.
Add variable-length vector to table to contain
list of UPLs
Non-normalised table not allowed by most DBMS
Introduce separate table to map object IDs to
UPLs.
Adds significant complexity to search/join
operations

33
Speeding up Spatial Data Access

Use Parallel Hardware such as Beowulf Clusters
We may be able to spread a large table over the
disc drives of several nodes in a cluster for
faster searches.
Some support for this is built into DBMS such as
Oracle, DB2 and SQL Server. There is currently
nothing to support it in Open Source products
such as MySQL and Postgres.
Possible do-it-yourself approach split table
into separate DBMS instances on separate nodes.
Farm out query to all instances, then combine the
results.
Column-oriented storage
Have tested Sybase-IQ performance good, but
expensive. No multi-dimensional indexing.

34
Can we do Distributed Searches and Joins?

Object catalogues of interest are located on data
archive servers all over the world. Is it
feasible to search or join without copying entire
tables?
Cone search services are becoming widespread. A
few servers allow a single search to be farmed
out to many servers, no attempt is made to merge
the results, they are just concatenated. Room
for improvement here?
An outer join requires every row in table1 to be
present in the output, so there is little point
in attempting a join over the network, when it is
simpler to copy table1 to the host holding
table2.
Perhaps need to set up astronomical data
warehouse with services to import tables and join
them.