Title: Support range of physical units and layouts. Radians, degre
1- Clive Page
- University of Leicester
- Meeting at ROE - 2005 January 25
- Cross-matching Catalogues
- Column-based storage for data exploring
2(No Transcript)
3Data Federation
- In order to extract full scientific value from
our datasets we have to combine information from
different wavebands. - Data Federation is one of the principal aims of
the VO projects but facilities to implement it
seem to be very slow in arriving. - Federation can be done from images, but images
are made in quite different ways in radio,
optical, and X-ray bands, so combining them is
difficult. - Usually images are searched for sources and these
make source catalogues.
4Cross-matching Catalogues
- Catalogues are large tables, so normally stored
in a relational database management system. - The principal matching criterion is approximate
spatial coincidence, so the main part of the
cross-matching process is a spatial join. - Most DBMS support spatial joins with some form of
spatial index- - DB2, Informix, MySQL, Postgres all have spatial
indexing built-in (mostly using R-trees). - Oracle, Sybase have optional spatial data
modules (R-trees) - Microsoft SQL Server no support for spatial
data. - If we want a generic solution for the VO it would
be better to use a spatial join which only needs
B-tree indexing.
5Spatial Join Algorithms
- R-trees (and similar spatial indices)
- Easy to use, quite efficient, but syntax not
standardised. - Sort/sweep algorithm of Dave Abel et al (CSIRO)
- Cannot be done in DBMS extract data in binary
form - Only efficient for large comparably sized tables.
- Declination Zone method of Jim Grey et al
- Seems to be efficient only on SQL Server.
- Pixel-code method cover sky with grid of pixels
- E.g. use HEALPix, HTM, or simple igloo grid.
- Works quite well on all DBMS used so far.
- Complicated we are joining regions (e.g.
circles) not points, and regions often cross
pixel-boundaries.
6Standard Catalogue - TableS
7Create Pixel-code table - PixelS
8Ingest Users own table - TableU
9Generate its pixel-table - PixelU
10Create B-tree Indices
11Join the four tables loop over rows of PixelU,
use indices on other three.
12Pixel-code method - advantages
- Seems to be efficient enough when small to medium
sized table joined to very large one (more
benchmarks needed). - Can be done with practically standard SQL, so
should be usable on all DBMS.
13Joining Catalogues four cases to consider
- Cone-search find matches to single
source/position. - User uploads own source-list to match with std
catalogues - User selects subset from some standard catalogue,
wants to cross-match with other large catalogues. - Cross-match of whole of one large catalogue with
another.
14Joining Catalogues practical cases
- Cone-search find matches to single
source/position. - Simple case, already supported by several
services - User uploads own source-list to match with
standard cats -
- User selects subset from some standard catalogue,
wants to cross-match with other large catalogues. - Cross-match of whole of one large catalogue with
another. - Substantial project use sort/sweep join outside
DBMS?
15Joining users Catalogue with Standard Catalogue
- Practical problems especially on tables
uploaded by the user - Support range of physical units and layouts
- Radians, degrees, HMS/DMS, etc.
- Various equinoxes, e.g. J2000, B1950 etc.
- Error may be fixed for whole table, or specified
by row(s). - Error regions may be circles, ellipses, or other
shapes - Error values may be specified as 1-sigma, 90,
etc.
16Software Development Proposed
- A little more development of pixel-code method,
to benchmark and check efficiency with existing
DBMS on a wider range of datasets. - Web application to allow users to upload their
own tables - Convert them to a DBMS-compatible form
- Ingest to DBMS
- Create necessary pixel-code table, indices.
- Needs way of allocating disc-space for users on
the server e.g using MySpace (AstroGrid) or MyDB
(NVO).
17Cross-matching system - Conclusions
- Fairly easy to develop required software to
handle users tables and cross-match with standard
catalogues. - Would provide new facilities that astronomers
currently lack and which might actually get
used. - A basic system could be set up in a matter of
weeks (assuming the required MySpace/MyDB
facilities were available).
18And now for something completely different
- Column-based data storage
- for data exploration/mining
19Data Exploration and Data Mining
- Operations very often performed on tabular
datasets. - Source catalogues, especially, are increasing in
size - all-sky surveys, large format CCDs, higher time
resolutions. - SDSS DR3 has about 1 Terabyte of tabular data.
- Several catalogues now 109 rows in length.
20Data Exploration and Data Mining
- Operations very often performed on tabular
datasets. - Source catalogues, especially, are increasing in
size - all-sky surveys, large format CCDs, higher time
resolutions. - SDSS DR3 has about 1 Terabyte of tabular data.
- Several catalogues now 109 rows in length.
- Width of tables also increasing
21Storage of Tables
- Tables are normally stored in relational DBMS
- There are two obvious ways to store tabular data
- Row-oriented, i.e. all fields in a row stored
contiguously - Column-oriented, i.e. all values in a column
stored as a vector in contiguous blocks on disc. - RDBMS are designed and optimised for transactions
(insert/delete/update) so they almost always use
row-oriented storage.
22Can we keep up with Moores Law?
- Astronomical data volume, processing power, disc
storage doubling in about every 2 years. - I/O bandwidth and seek times are improving much
more slowly (10 per year) - Increasingly we want to manipulate datasets which
are too large to be held in memory (even though
memory sizes are also increasing) so disc
speeds are very important. - Conclusion I/O is becoming a more serious
bottleneck.
23Query Types
- Index-accelerated queries
- Selection on value (or range of values) in an
indexed column. - Sequential scans necessary when
- Creating an index
- Populating new columns e.g. using UPDATE
- Selecting on an expression
- SELECT FROM table WHERE (bmag-vmag) 0.5
- Computing statistics
- Joining one table with another (must scan at
least one table, can use index on the other) - Selecting, where index selectivity is too small (
a random seek takes 10ms, can scan 500 disc
blocks in that time).
24Query Timing
- Index-accelerated queries
- Can return results in as little as 20 ms the
time for a couple of disk-seeks. - Sequential scans of a table
- Time depends (obviously) on table length
- For tables of a billion rows, any scan operation
takes at least 30 minutes. - Highly desirable to speed these up so can be
interactive not batch operations. - For queries which scan large tables but only
access a few columns out of many, row-based
storage is very inefficient, column-oriented
storage would be much faster.
25Systems using column-oriented tables
- Sybase-IQ best known commercial product from
makers of Sybase-ASE relational database. Fast,
but - No Linux version yet
- No spatial indexing
- Very expensive licence costs
- ESO/MIDAS table system
- STSDAS table system (part of IRAF)
- FITS files can (just) store data in column-based
form but these files too peculiar for most FITS
utilities to cope.
26Using HDF5 for column-oriented storage
- Hierarchical Data Format 5
- Comes from NCSA, well-supported library and tools
- Compatible with Globus, GridFTP
- Designed for efficient access to big files (no 2
GB limit) - Simple API (bindings to C, Fortran90, Java,
Python) - Can attach unlimited metadata to tables and
columns. - Basic prototype was constructed and benchmarked
- Used query parser and evaluator originally
written for Starlinks CURSA about 1992. - Around 10 sample of 2MASS catalogue put into
HDF5 column-based format. - Queries found to be 13 to 50 times faster using
HDF5 than with fastest RDBMS (MySQL).
27Development of an HDF5-based data explorer?
- HDF5 library reliable, efficient, easy to use.
- Query parser and evaluator CURSA, or Java
library - Graphics package great variety of them
available - Statistics routines many available
- Indexing many B-tree libraries available
- Spatial indexing free R-tree code exists, or
use pixel-code methods. - Integration with VO for authentication, MySPACE,
etc. - Provide as stand-alone package, and as a Web
Service. - User interface allowing iterative filtering,
undo, etc. - Probably the most difficult part of the design.
28Additional Advantages
- Makes distributed cross-match more feasible only
transfer positional columns from one site to
another. - Simple API - so users with their own data mining
code (clustering, classification, etc) can read
HDF5 datasets directly. - Easy to add access to tables in other formats
such as FITS, VOTable, CSV, or (using JDBC) those
in other DBMS. - Can provide user with integrated graphics,
statistics, etc. and a host of other desirable
data exploration features missing from commercial
RDBMS.
29Column-oriented Data Explorer - Conclusions
- Developing a column-oriented data storage system
using HDF5 is not a trivial exercise, but does
not appear to be too difficult. - It is not intended to be a general-purpose DBMS,
and should be seen as something complementing the
DBMS in a data archive system or astronomical
data warehouse. - Storage of same data in row-oriented and
column-oriented form may be seen as wasteful, but
disc is now cheap.