Spatial Indexing - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Spatial Indexing

Description:

... Mesh invented at Johns Hopkins University now used by several projects. ... Possible do-it-yourself approach: split table into separate DBMS instances ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 35
Provided by: GilesMur7
Category:
Tags: indexing | spatial

less

Transcript and Presenter's Notes

Title: Spatial Indexing


1
Spatial Indexing
  • Clive Page

2
(No Transcript)
3
Formats of Raw Data
  • Radio
  • Complex visibility for each polarisation at set
    of points sampling the complex (u,v) plane.
  • Infra-red, Optical, Ultra-violet
  • Images from 1k1k to 18k20k, collected at
    intervals of a few seconds to several minutes.
  • X-ray, Gamma-ray
  • Photon-even lists get properties (x, y, time,
    energy) for each detected photon. May get a list
    of millions of these over an integration period
    of a few hours.

4
Formats of Reduced Data
  • Images
  • Time-series
  • Spectra
  • Catalogues of astronomical objects
  • Vital to cross-identify objects from different
    wavebands, basis for many subsequent data mining
    investigations.
  • Problem tables can be long or wide

Catalogue Rows Columns
Optical USNO-B 1,045,913,669 30
Infra-red 2MASS 470,992,970 60
X-ray 1XMM 56,711 379
5
Main Functionality
  • SELECT objects in a given small patch of sky
  • In a rectangle to cover same region as an image
  • In a circle to cover a radius around point of
    interest (also known as search in a cone).
  • In a polygon e.g. around extended object.
  • Spatial JOIN Cross-match objects from, for
    example, two wavebands or two epochs.
  • Principal matching criterion is overlap of
    error-circles.
  • Often important to find objects in one table
    which are NOT matched in the other need a left
    outer join.

6
Functionality (continued)
  • Self JOIN useful in some applications, for
    example
  • Search for clusters of galaxies
  • Search for double stars
  • Find spatial distribution or spatial correlation
    functions for particular classes of objects
  • Those observing with adaptive optics need a
    suitable reference star in or near each field,
    e.g. might want to find all AGNs near to bright
    stars to plan an observing campaign.

7
Current on-line services
  • Search in a cone many on-line services
  • Cross-matching (spatial JOIN) few current
    examples
  • Astrobrowse (at GSFC) has pre-computed joins for
    a limited number of important catalogues.
  • Vizier service (CDS) allows cross-match of users
    own table of positions with list of catalogues.
  • But results of list of N sources and M catalogues
    are presented as unmerged list of MN separate
    tables.
  • Skyserver joins between SDSS and other tables.
  • Self-join do not know of any on-line services at
    present.

8
Problems handling object catalogues
  • Positions use spherical-polar coordinates (RA,
    Dec)
  • Right Ascension corresponds to geographic
    longitude
  • Declination corresponds to geographic latitude
  • There are singularities at the poles and
    distortions in the scales everywhere except at
    the equator.
  • RA wraps from 24 hours (360 degrees) to zero.
  • All object positions are imprecise ? positions
    have an error radius.
  • Distances between points must use a great-circle
    distance function not cartesian distance.
  • Two-dimensional indexing is really needed

9
Indexing Possibilities
  1. Home-brew slicing of the sky
  2. Use B-tree on one spatial axis only
  3. Use 1-d to 2-d mapping function then simple
    B-tree
  4. Use true spatial index such as R-tree.

10
Home-brew slicing of sky
  • Both USNO-B and 2MASS issued as separate files,
    each covering 0.1 degree in declination, sorted
    in RA within each strip.
  • Software from Harvard-Smithsonian Center for
    Astrophysics (WCSTOOLS) allows fairly efficient
    access to data stored in these strips.
  • WFCSTOOLS is used by many astronomical archives
    around the world, (including LEDAS
    www.ledas.ac.uk).
  • Ok for cone-search services, not suitable for
    spatial joins.

11
Index one spatial axis only
  • Widely used, including Vizier collection of 3000
    astronomical tables.
  • Index on declination (avoids RA wrap-around
    problem)
  • Poor (but acceptable) performance on cone-search
  • E.g. consider USNO-B a table of a billion rows
  • Typical search/join uses a radius of say 3
    arc-seconds.
  • 17 chance of finding a match at a random
    position.
  • Index on declination effectively searches a strip
    360 x 6 arc-seconds get around 10,000 rows
    matching. Need to check all these to find the (0
    or 1) true matches.
  • Conclusion might gain five orders of magnitude
    in efficiency by using a true 2-d index.

12
One-dimensional index spatial join problem
  • Very inefficient, and efficiency is needed when
    joining one large table with another
  • Join criterion is, typically
  • SELECT FROM cat1,cat2 WHERE
  • ABS(dec1 - dec2) lt combinedError
  • AND
  • I have not yet found any DBMS with an optimiser
    which uses an index when confronted with a join
    criterion of this form (or indeed any other
    expression for joining on an overlap of error
    ranges).

13
Mapping functions (2-d to 1-d)
  • Cover the space with cells (pixels) and number
    them.
  • Create conventional B-tree on resulting set of
    integers.
  • Each point in the sky maps to a single integer.
  • An area maps to a set of integers.
  • A seductive idea
  • If a small spatial area maps to a smallish range
    of integers, then a search over a spatial area
    might be done by a B-tree search over a small
    integer range.
  • Various space-filling curves have been used in
    the hope that this works in practice, e.g.
    Z-order index, Hilbert, Peano curves, and many
    others.

14
Hilbert Curve
15
Z-order (bit-interleaved) Mapping Function
16
Space-filling Curves as Mapping Functions
  • Excellent performance when searching for single
    points.
  • For area searches, the median performance is
    adequate, but all curves suffer the same
    drawback
  • There exist cases in which nearby points in space
    are very far apart on the curve (in the Z-order
    index this corresponds to a high-order bit
    flipping).
  • Performance tests confirm this defect the
    worst-case performance is so abysmal that the
    average performance is very poor.
  • Another problem simple cartesian grids also
    unsuited to spherical-polar coordinate searches
    as there are too many tiny distorted pixels near
    the poles.

17
Better Mapping Functions
  • Aim cover sky uniformly with pixels - integer
    pixel-codes.
  • HEALPix - Hierarchical Equal Area iso-Latitude
    Pixelisation invented at ESO for COBE.
  • Pseudo-square pixels, 2-d arrays of pixels
    suitable for analysis of large-scale spatial
    structures.
  • HTM - Hierarchical Triangular Mesh invented at
    Johns Hopkins University now used by several
    projects.
  • Triangular pixels.
  • Hierarchical numbering high-order bits are a
    valid HTM index for coarser grid.
  • Both algorithms perfectly good when searching for
    points.

18
Hierarchical Equal Area iso-Latitude Pixelisation
(HEALPix)
19
Hierarchical Triangular Mesh (HTM)
20
Spatial Join using Pixel-code Method
  • Given table CAT1 with columns
  • ID1 primary key
  • RA
  • DEC
  • POSERR
  • MAGNITUDE
  • etc
  • Given table CAT2 with columns
  • ID2 primary key
  • RA
  • DEC
  • POSERR
  • FLUX
  • etc

21
Create tables P1, P2 with pixel-code column
  • Given table CAT1 with columns
  • ID1 primary key
  • RA
  • DEC
  • POSERR
  • MAGNITUDE
  • etc
  • Create table P1
  • ID1 foreign key
  • PCODE1 primary key
  • Note P1 and P2 have extra rows where
    error-circle overlaps two or more pixels.
  • Given table CAT2 with columns
  • ID2 primary key
  • RA
  • DEC
  • POSERR
  • FLUX
  • etc
  • Create table P2
  • ID2 foreign key
  • PCODE2 primary key

22
Join P1 and P2 on pixel-codes creating PJOIN
  • Given table CAT1 with columns
  • ID1 primary key
  • RA
  • DEC
  • POSERR
  • MAGNITUDE
  • etc
  • Create table P1
  • ID1 foreign key
  • PCODE1 primary key
  • Create table PJOIN by joining P1 and P2 on
    PCODE1PCODE2
  • ID1
  • ID2
  • Given table CAT2 with columns
  • ID2 primary key
  • RA
  • DEC
  • POSERR
  • FLUX
  • etc
  • Create table P2
  • ID2 foreign key
  • PCODE2 primary key

23
PJOIN table
  • Join using SELECT DISTINCT to remove any
    duplicates (two error-circles may each overlap
    two or more pixels).
  • Table PJOIN identifies all pixels where
    error-circles potentially overlap
  • Circles may or may not actually overlap, may just
    be nearby in the same pixels.
  • Next step create B-tree index on PJOIN(ID1)

24
Use PJOIN table to match catalogue rows
  • Three-way join then produces required results,
    e.g.
  • SELECT cols FROM CAT1, PJOIN, CAT2
  • WHERE CAT1.ID1PJOIN.ID1
  • AND PJOIN.ID2CAT2.ID2
  • AND (2 asin(sqrt(pow(sin((cat1.dec-
    cat2.dec)/2),2) cos(cat1.dec) cos(cat2.dec)
    pow(sin((cat1.ra- cat2.ra)/2),2)))
  • lt cat1.poserrcat2.poserr)
  • This works, and speed appears good, but more
    testing is needed on a wider range of datasets.

25
True Spatial Indexing
  • Hot topic of research in computer science
    departments for more than 20 years
  • Very many algorithms have been proposed
  • BANG file, BV-tree, Buddy tree, Cell tree,
    G-tree, GBD-tree, Gridfile, hB-tree, kd-tree,
    LSD-tree, P-tree, PK-tree, PLOP hashing, Pyramid
    tree, Q0-tree, Quadtree, R-tree, SKD-tree,
    SR-tree, SS-tree, TV-tree, UB-tree.
  • So many alternatives, but none of them has
    properties as good as the B-tree in one dimension
    (e.g. compact and efficient, with fairly good
    worst-case performance).
  • R-tree one of the earliest structures, one of the
    best.
  • R-trees are built into several modern DBMS.

26
Spatial Options in current DBMS
DBMS Spatial Index
DB2 Spatial Extender multi-level grid file
Informix R-tree indexing included
Ingres
Interbase/Firebird
Microsoft SQL Server
MySQL R-trees on arbitrary polygons
Microsoft SQL Server
Oracle Spatial option R-trees
Postgres R-trees on rectangular boxes
Sybase Spatial Option (Boeing SQS) R-trees
27
Using R-trees
  • Must draw a rectangular box outside each error
    circle
  • Boxes get rather extended (along RA axis) near
    poles
  • Need a subsequent filter to remove spurious
    matches where rectangles overlap but
    error-circles do not.
  • If the error criterion alters (e.g. user wants
    99 probability circle rather than 90) need to
    recreate column of boxes, and then recreate the
    R-tree index.
  • Solution always use a box as large as anyone
    could possible want, subsequent filtering on
    error-circles is still quite cheap.

28
R-tree Performance
  • Postgres R-tree indexing works as advertised
  • R-trees are large, creation is slow, e.g. 2 hours
    for table of 3.5 million rows.
  • MySQL latest version allows R-trees on any
    polygon (but anything above 4 sides is wasteful).
  • Very verbose external data format
  • Works as advertised, not yet measured
    performance.
  • Informix Kalpakis et al. (ADASS conference
    report) loaded part of USNO-A2 and found data
    load and R-tree creation would have taken 39 days
    for the entire table of 500 million rows.

29
Comparison of Pixel-code and R-tree Methods
  • Advantages
  • Pcode join seems to be faster (but not yet
    benchmarked with identical systems).
  • Takes up less disc space in total.
  • Can use any DBMS, not just those with an R-tree
    or other spatial data option.
  • Disadvantages
  • Additional tables and indices have to be created.
  • More complex set of joins.
  • Needs external code as neither HTM or HEALPix can
    be expressed as an SQL-callable function (because
    they return a variable-length array of integers).

30
Indexing summary
  • Indexing on just one spatial axis is simply too
    inefficient for large tables and cannot support
    joins.
  • R-trees are powerful and easy to use, but index
    creation times are a serious cause for concern.
  • 2d?1d mapping functions such as HTM or HEALPix
    are more complicated to use but may be faster in
    some cases.
  • AstroGrid will continue work in this area.

31
Idea HTM as Universal Position Locator?
  • Can choose HTM (or HEALPix) resolution as high as
    needed
  • 32-bit words pixels around 24 arcsec on a side.
  • More pixels than objects in largest current
    catalogue
  • 64-bit words pixels around 0.4 milliarcsec on a
    side.
  • Ample precision for some time to come
  • Maybe every object catalogue should list the UPL
    of each celestial object, allowing fast and easy
    searches and joins?
  • Fine for points in sky, but real objects have
    finite extent, and some will correspond to a more
    than one UPL
  • Some objects move should pixel-code change with
    time?
  • Reference frame also moves

32
UPLs for extended objects possible work-arounds
  • Choose a pixel size large enough to encompass the
    error region in each case
  • In some cases this would be very large indeed
  • Different objects would have different pixel size
  • Have extra rows in tables for objects crossing
    pixel boundaries.
  • Need to alter original tables, not always
    acceptable.
  • Add variable-length vector to table to contain
    list of UPLs
  • Non-normalised table not allowed by most DBMS
  • Introduce separate table to map object IDs to
    UPLs.
  • Adds significant complexity to search/join
    operations

33
Speeding up Spatial Data Access
  • Use Parallel Hardware such as Beowulf Clusters
  • We may be able to spread a large table over the
    disc drives of several nodes in a cluster for
    faster searches.
  • Some support for this is built into DBMS such as
    Oracle, DB2 and SQL Server. There is currently
    nothing to support it in Open Source products
    such as MySQL and Postgres.
  • Possible do-it-yourself approach split table
    into separate DBMS instances on separate nodes.
    Farm out query to all instances, then combine the
    results.
  • Column-oriented storage
  • Have tested Sybase-IQ performance good, but
    expensive. No multi-dimensional indexing.

34
Can we do Distributed Searches and Joins?
  • Object catalogues of interest are located on data
    archive servers all over the world. Is it
    feasible to search or join without copying entire
    tables?
  • Cone search services are becoming widespread. A
    few servers allow a single search to be farmed
    out to many servers, no attempt is made to merge
    the results, they are just concatenated. Room
    for improvement here?
  • An outer join requires every row in table1 to be
    present in the output, so there is little point
    in attempting a join over the network, when it is
    simpler to copy table1 to the host holding
    table2.
  • Perhaps need to set up astronomical data
    warehouse with services to import tables and join
    them.
Write a Comment
User Comments (0)
About PowerShow.com