Title: Spatial Indexing
1Spatial Indexing
2(No Transcript)
3Formats of Raw Data
- Radio
- Complex visibility for each polarisation at set
of points sampling the complex (u,v) plane. - Infra-red, Optical, Ultra-violet
- Images from 1k1k to 18k20k, collected at
intervals of a few seconds to several minutes. - X-ray, Gamma-ray
- Photon-even lists get properties (x, y, time,
energy) for each detected photon. May get a list
of millions of these over an integration period
of a few hours.
4Formats of Reduced Data
- Images
- Time-series
- Spectra
- Catalogues of astronomical objects
- Vital to cross-identify objects from different
wavebands, basis for many subsequent data mining
investigations. - Problem tables can be long or wide
Catalogue Rows Columns
Optical USNO-B 1,045,913,669 30
Infra-red 2MASS 470,992,970 60
X-ray 1XMM 56,711 379
5Main Functionality
- SELECT objects in a given small patch of sky
- In a rectangle to cover same region as an image
- In a circle to cover a radius around point of
interest (also known as search in a cone). - In a polygon e.g. around extended object.
- Spatial JOIN Cross-match objects from, for
example, two wavebands or two epochs. - Principal matching criterion is overlap of
error-circles. - Often important to find objects in one table
which are NOT matched in the other need a left
outer join.
6Functionality (continued)
- Self JOIN useful in some applications, for
example - Search for clusters of galaxies
- Search for double stars
- Find spatial distribution or spatial correlation
functions for particular classes of objects - Those observing with adaptive optics need a
suitable reference star in or near each field,
e.g. might want to find all AGNs near to bright
stars to plan an observing campaign.
7Current on-line services
- Search in a cone many on-line services
- Cross-matching (spatial JOIN) few current
examples - Astrobrowse (at GSFC) has pre-computed joins for
a limited number of important catalogues. - Vizier service (CDS) allows cross-match of users
own table of positions with list of catalogues. - But results of list of N sources and M catalogues
are presented as unmerged list of MN separate
tables. - Skyserver joins between SDSS and other tables.
- Self-join do not know of any on-line services at
present.
8Problems handling object catalogues
- Positions use spherical-polar coordinates (RA,
Dec) - Right Ascension corresponds to geographic
longitude - Declination corresponds to geographic latitude
- There are singularities at the poles and
distortions in the scales everywhere except at
the equator. - RA wraps from 24 hours (360 degrees) to zero.
- All object positions are imprecise ? positions
have an error radius. - Distances between points must use a great-circle
distance function not cartesian distance. - Two-dimensional indexing is really needed
9Indexing Possibilities
- Home-brew slicing of the sky
- Use B-tree on one spatial axis only
- Use 1-d to 2-d mapping function then simple
B-tree - Use true spatial index such as R-tree.
10Home-brew slicing of sky
- Both USNO-B and 2MASS issued as separate files,
each covering 0.1 degree in declination, sorted
in RA within each strip. - Software from Harvard-Smithsonian Center for
Astrophysics (WCSTOOLS) allows fairly efficient
access to data stored in these strips. - WFCSTOOLS is used by many astronomical archives
around the world, (including LEDAS
www.ledas.ac.uk). - Ok for cone-search services, not suitable for
spatial joins.
11Index one spatial axis only
- Widely used, including Vizier collection of 3000
astronomical tables. - Index on declination (avoids RA wrap-around
problem) - Poor (but acceptable) performance on cone-search
- E.g. consider USNO-B a table of a billion rows
- Typical search/join uses a radius of say 3
arc-seconds. - 17 chance of finding a match at a random
position. - Index on declination effectively searches a strip
360 x 6 arc-seconds get around 10,000 rows
matching. Need to check all these to find the (0
or 1) true matches. - Conclusion might gain five orders of magnitude
in efficiency by using a true 2-d index.
12One-dimensional index spatial join problem
- Very inefficient, and efficiency is needed when
joining one large table with another - Join criterion is, typically
- SELECT FROM cat1,cat2 WHERE
- ABS(dec1 - dec2) lt combinedError
- AND
- I have not yet found any DBMS with an optimiser
which uses an index when confronted with a join
criterion of this form (or indeed any other
expression for joining on an overlap of error
ranges).
13Mapping functions (2-d to 1-d)
- Cover the space with cells (pixels) and number
them. - Create conventional B-tree on resulting set of
integers. - Each point in the sky maps to a single integer.
- An area maps to a set of integers.
- A seductive idea
- If a small spatial area maps to a smallish range
of integers, then a search over a spatial area
might be done by a B-tree search over a small
integer range. - Various space-filling curves have been used in
the hope that this works in practice, e.g.
Z-order index, Hilbert, Peano curves, and many
others.
14Hilbert Curve
15Z-order (bit-interleaved) Mapping Function
16Space-filling Curves as Mapping Functions
- Excellent performance when searching for single
points. - For area searches, the median performance is
adequate, but all curves suffer the same
drawback - There exist cases in which nearby points in space
are very far apart on the curve (in the Z-order
index this corresponds to a high-order bit
flipping). - Performance tests confirm this defect the
worst-case performance is so abysmal that the
average performance is very poor. - Another problem simple cartesian grids also
unsuited to spherical-polar coordinate searches
as there are too many tiny distorted pixels near
the poles.
17Better Mapping Functions
- Aim cover sky uniformly with pixels - integer
pixel-codes. - HEALPix - Hierarchical Equal Area iso-Latitude
Pixelisation invented at ESO for COBE. - Pseudo-square pixels, 2-d arrays of pixels
suitable for analysis of large-scale spatial
structures. - HTM - Hierarchical Triangular Mesh invented at
Johns Hopkins University now used by several
projects. - Triangular pixels.
- Hierarchical numbering high-order bits are a
valid HTM index for coarser grid. - Both algorithms perfectly good when searching for
points.
18Hierarchical Equal Area iso-Latitude Pixelisation
(HEALPix)
19Hierarchical Triangular Mesh (HTM)
20Spatial Join using Pixel-code Method
- Given table CAT1 with columns
- ID1 primary key
- RA
- DEC
- POSERR
- MAGNITUDE
- etc
- Given table CAT2 with columns
- ID2 primary key
- RA
- DEC
- POSERR
- FLUX
- etc
21Create tables P1, P2 with pixel-code column
- Given table CAT1 with columns
- ID1 primary key
- RA
- DEC
- POSERR
- MAGNITUDE
- etc
- Create table P1
- ID1 foreign key
- PCODE1 primary key
- Note P1 and P2 have extra rows where
error-circle overlaps two or more pixels.
- Given table CAT2 with columns
- ID2 primary key
- RA
- DEC
- POSERR
- FLUX
- etc
- Create table P2
- ID2 foreign key
- PCODE2 primary key
22Join P1 and P2 on pixel-codes creating PJOIN
- Given table CAT1 with columns
- ID1 primary key
- RA
- DEC
- POSERR
- MAGNITUDE
- etc
- Create table P1
- ID1 foreign key
- PCODE1 primary key
- Create table PJOIN by joining P1 and P2 on
PCODE1PCODE2 - ID1
- ID2
- Given table CAT2 with columns
- ID2 primary key
- RA
- DEC
- POSERR
- FLUX
- etc
- Create table P2
- ID2 foreign key
- PCODE2 primary key
23PJOIN table
- Join using SELECT DISTINCT to remove any
duplicates (two error-circles may each overlap
two or more pixels). - Table PJOIN identifies all pixels where
error-circles potentially overlap - Circles may or may not actually overlap, may just
be nearby in the same pixels. - Next step create B-tree index on PJOIN(ID1)
24Use PJOIN table to match catalogue rows
- Three-way join then produces required results,
e.g. - SELECT cols FROM CAT1, PJOIN, CAT2
- WHERE CAT1.ID1PJOIN.ID1
- AND PJOIN.ID2CAT2.ID2
- AND (2 asin(sqrt(pow(sin((cat1.dec-
cat2.dec)/2),2) cos(cat1.dec) cos(cat2.dec)
pow(sin((cat1.ra- cat2.ra)/2),2))) - lt cat1.poserrcat2.poserr)
- This works, and speed appears good, but more
testing is needed on a wider range of datasets.
25True Spatial Indexing
- Hot topic of research in computer science
departments for more than 20 years - Very many algorithms have been proposed
- BANG file, BV-tree, Buddy tree, Cell tree,
G-tree, GBD-tree, Gridfile, hB-tree, kd-tree,
LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid
tree, Q0-tree, Quadtree, R-tree, SKD-tree,
SR-tree, SS-tree, TV-tree, UB-tree. - So many alternatives, but none of them has
properties as good as the B-tree in one dimension
(e.g. compact and efficient, with fairly good
worst-case performance). - R-tree one of the earliest structures, one of the
best. - R-trees are built into several modern DBMS.
26Spatial Options in current DBMS
DBMS Spatial Index
DB2 Spatial Extender multi-level grid file
Informix R-tree indexing included
Ingres
Interbase/Firebird
Microsoft SQL Server
MySQL R-trees on arbitrary polygons
Microsoft SQL Server
Oracle Spatial option R-trees
Postgres R-trees on rectangular boxes
Sybase Spatial Option (Boeing SQS) R-trees
27Using R-trees
- Must draw a rectangular box outside each error
circle - Boxes get rather extended (along RA axis) near
poles - Need a subsequent filter to remove spurious
matches where rectangles overlap but
error-circles do not. - If the error criterion alters (e.g. user wants
99 probability circle rather than 90) need to
recreate column of boxes, and then recreate the
R-tree index. - Solution always use a box as large as anyone
could possible want, subsequent filtering on
error-circles is still quite cheap.
28R-tree Performance
- Postgres R-tree indexing works as advertised
- R-trees are large, creation is slow, e.g. 2 hours
for table of 3.5 million rows. - MySQL latest version allows R-trees on any
polygon (but anything above 4 sides is wasteful). - Very verbose external data format
- Works as advertised, not yet measured
performance. - Informix Kalpakis et al. (ADASS conference
report) loaded part of USNO-A2 and found data
load and R-tree creation would have taken 39 days
for the entire table of 500 million rows.
29Comparison of Pixel-code and R-tree Methods
- Advantages
- Pcode join seems to be faster (but not yet
benchmarked with identical systems). - Takes up less disc space in total.
- Can use any DBMS, not just those with an R-tree
or other spatial data option. - Disadvantages
- Additional tables and indices have to be created.
- More complex set of joins.
- Needs external code as neither HTM or HEALPix can
be expressed as an SQL-callable function (because
they return a variable-length array of integers).
30Indexing summary
- Indexing on just one spatial axis is simply too
inefficient for large tables and cannot support
joins. - R-trees are powerful and easy to use, but index
creation times are a serious cause for concern. - 2d?1d mapping functions such as HTM or HEALPix
are more complicated to use but may be faster in
some cases. - AstroGrid will continue work in this area.
31Idea HTM as Universal Position Locator?
- Can choose HTM (or HEALPix) resolution as high as
needed - 32-bit words pixels around 24 arcsec on a side.
- More pixels than objects in largest current
catalogue - 64-bit words pixels around 0.4 milliarcsec on a
side. - Ample precision for some time to come
- Maybe every object catalogue should list the UPL
of each celestial object, allowing fast and easy
searches and joins? - Fine for points in sky, but real objects have
finite extent, and some will correspond to a more
than one UPL - Some objects move should pixel-code change with
time? - Reference frame also moves
32UPLs for extended objects possible work-arounds
- Choose a pixel size large enough to encompass the
error region in each case - In some cases this would be very large indeed
- Different objects would have different pixel size
- Have extra rows in tables for objects crossing
pixel boundaries. - Need to alter original tables, not always
acceptable. - Add variable-length vector to table to contain
list of UPLs - Non-normalised table not allowed by most DBMS
- Introduce separate table to map object IDs to
UPLs. - Adds significant complexity to search/join
operations
33Speeding up Spatial Data Access
- Use Parallel Hardware such as Beowulf Clusters
- We may be able to spread a large table over the
disc drives of several nodes in a cluster for
faster searches. - Some support for this is built into DBMS such as
Oracle, DB2 and SQL Server. There is currently
nothing to support it in Open Source products
such as MySQL and Postgres. - Possible do-it-yourself approach split table
into separate DBMS instances on separate nodes.
Farm out query to all instances, then combine the
results. - Column-oriented storage
- Have tested Sybase-IQ performance good, but
expensive. No multi-dimensional indexing.
34Can we do Distributed Searches and Joins?
- Object catalogues of interest are located on data
archive servers all over the world. Is it
feasible to search or join without copying entire
tables? - Cone search services are becoming widespread. A
few servers allow a single search to be farmed
out to many servers, no attempt is made to merge
the results, they are just concatenated. Room
for improvement here? - An outer join requires every row in table1 to be
present in the output, so there is little point
in attempting a join over the network, when it is
simpler to copy table1 to the host holding
table2. - Perhaps need to set up astronomical data
warehouse with services to import tables and join
them.