Support range of physical units and layouts. Radians, degre presentation

About This Presentation

Transcript and Presenter's Notes

Title: Support range of physical units and layouts. Radians, degre

1

Clive Page
University of Leicester
Meeting at ROE - 2005 January 25
Cross-matching Catalogues
Column-based storage for data exploring

2
(No Transcript)
3
Data Federation

In order to extract full scientific value from
our datasets we have to combine information from
different wavebands.
Data Federation is one of the principal aims of
the VO projects but facilities to implement it
seem to be very slow in arriving.
Federation can be done from images, but images
are made in quite different ways in radio,
optical, and X-ray bands, so combining them is
difficult.
Usually images are searched for sources and these
make source catalogues.

4
Cross-matching Catalogues

Catalogues are large tables, so normally stored
in a relational database management system.
The principal matching criterion is approximate
spatial coincidence, so the main part of the
cross-matching process is a spatial join.
Most DBMS support spatial joins with some form of
spatial index-
DB2, Informix, MySQL, Postgres all have spatial
indexing built-in (mostly using R-trees).
Oracle, Sybase have optional spatial data
modules (R-trees)
Microsoft SQL Server no support for spatial
data.
If we want a generic solution for the VO it would
be better to use a spatial join which only needs
B-tree indexing.

5
Spatial Join Algorithms

R-trees (and similar spatial indices)
Easy to use, quite efficient, but syntax not
standardised.
Sort/sweep algorithm of Dave Abel et al (CSIRO)
Cannot be done in DBMS extract data in binary
form
Only efficient for large comparably sized tables.
Declination Zone method of Jim Grey et al
Seems to be efficient only on SQL Server.
Pixel-code method cover sky with grid of pixels
E.g. use HEALPix, HTM, or simple igloo grid.
Works quite well on all DBMS used so far.
Complicated we are joining regions (e.g.
circles) not points, and regions often cross
pixel-boundaries.

6
Standard Catalogue - TableS

7
Create Pixel-code table - PixelS

8
Ingest Users own table - TableU

9
Generate its pixel-table - PixelU

10
Create B-tree Indices

11
Join the four tables loop over rows of PixelU,
use indices on other three.

12
Pixel-code method - advantages

Seems to be efficient enough when small to medium
sized table joined to very large one (more
benchmarks needed).
Can be done with practically standard SQL, so
should be usable on all DBMS.

13
Joining Catalogues four cases to consider

Cone-search find matches to single
source/position.
User uploads own source-list to match with std
catalogues
User selects subset from some standard catalogue,
wants to cross-match with other large catalogues.
Cross-match of whole of one large catalogue with
another.

14
Joining Catalogues practical cases

Cone-search find matches to single
source/position.
Simple case, already supported by several
services
User uploads own source-list to match with
standard cats
User selects subset from some standard catalogue,
wants to cross-match with other large catalogues.
Cross-match of whole of one large catalogue with
another.
Substantial project use sort/sweep join outside
DBMS?

15
Joining users Catalogue with Standard Catalogue

Practical problems especially on tables
uploaded by the user
Support range of physical units and layouts
Radians, degrees, HMS/DMS, etc.
Various equinoxes, e.g. J2000, B1950 etc.
Error may be fixed for whole table, or specified
by row(s).
Error regions may be circles, ellipses, or other
shapes
Error values may be specified as 1-sigma, 90,
etc.

16
Software Development Proposed

A little more development of pixel-code method,
to benchmark and check efficiency with existing
DBMS on a wider range of datasets.
Web application to allow users to upload their
own tables
Convert them to a DBMS-compatible form
Ingest to DBMS
Create necessary pixel-code table, indices.
Needs way of allocating disc-space for users on
the server e.g using MySpace (AstroGrid) or MyDB
(NVO).

17
Cross-matching system - Conclusions

Fairly easy to develop required software to
handle users tables and cross-match with standard
catalogues.
Would provide new facilities that astronomers
currently lack and which might actually get
used.
A basic system could be set up in a matter of
weeks (assuming the required MySpace/MyDB
facilities were available).

18
And now for something completely different

Column-based data storage
for data exploration/mining

19
Data Exploration and Data Mining

Operations very often performed on tabular
datasets.
Source catalogues, especially, are increasing in
size
all-sky surveys, large format CCDs, higher time
resolutions.
SDSS DR3 has about 1 Terabyte of tabular data.
Several catalogues now 109 rows in length.

20
Data Exploration and Data Mining

Operations very often performed on tabular
datasets.
Source catalogues, especially, are increasing in
size
all-sky surveys, large format CCDs, higher time
resolutions.
SDSS DR3 has about 1 Terabyte of tabular data.
Several catalogues now 109 rows in length.
Width of tables also increasing

21
Storage of Tables

Tables are normally stored in relational DBMS
There are two obvious ways to store tabular data
Row-oriented, i.e. all fields in a row stored
contiguously
Column-oriented, i.e. all values in a column
stored as a vector in contiguous blocks on disc.
RDBMS are designed and optimised for transactions
(insert/delete/update) so they almost always use
row-oriented storage.

22
Can we keep up with Moores Law?

Astronomical data volume, processing power, disc
storage doubling in about every 2 years.
I/O bandwidth and seek times are improving much
more slowly (10 per year)
Increasingly we want to manipulate datasets which
are too large to be held in memory (even though
memory sizes are also increasing) so disc
speeds are very important.
Conclusion I/O is becoming a more serious
bottleneck.

23
Query Types

Index-accelerated queries
Selection on value (or range of values) in an
indexed column.
Sequential scans necessary when
Creating an index
Populating new columns e.g. using UPDATE
Selecting on an expression
SELECT FROM table WHERE (bmag-vmag) 0.5
Computing statistics
Joining one table with another (must scan at
least one table, can use index on the other)
Selecting, where index selectivity is too small (
a random seek takes 10ms, can scan 500 disc
blocks in that time).

24
Query Timing

Index-accelerated queries
Can return results in as little as 20 ms the
time for a couple of disk-seeks.
Sequential scans of a table
Time depends (obviously) on table length
For tables of a billion rows, any scan operation
takes at least 30 minutes.
Highly desirable to speed these up so can be
interactive not batch operations.
For queries which scan large tables but only
access a few columns out of many, row-based
storage is very inefficient, column-oriented
storage would be much faster.

25
Systems using column-oriented tables

Sybase-IQ best known commercial product from
makers of Sybase-ASE relational database. Fast,
but
No Linux version yet
No spatial indexing
Very expensive licence costs
ESO/MIDAS table system
STSDAS table system (part of IRAF)
FITS files can (just) store data in column-based
form but these files too peculiar for most FITS
utilities to cope.

26
Using HDF5 for column-oriented storage

Hierarchical Data Format 5
Comes from NCSA, well-supported library and tools
Compatible with Globus, GridFTP
Designed for efficient access to big files (no 2
GB limit)
Simple API (bindings to C, Fortran90, Java,
Python)
Can attach unlimited metadata to tables and
columns.
Basic prototype was constructed and benchmarked
Used query parser and evaluator originally
written for Starlinks CURSA about 1992.
Around 10 sample of 2MASS catalogue put into
HDF5 column-based format.
Queries found to be 13 to 50 times faster using
HDF5 than with fastest RDBMS (MySQL).

27
Development of an HDF5-based data explorer?

HDF5 library reliable, efficient, easy to use.
Query parser and evaluator CURSA, or Java
library
Graphics package great variety of them
available
Statistics routines many available
Indexing many B-tree libraries available
Spatial indexing free R-tree code exists, or
use pixel-code methods.
Integration with VO for authentication, MySPACE,
etc.
Provide as stand-alone package, and as a Web
Service.
User interface allowing iterative filtering,
undo, etc.
Probably the most difficult part of the design.

28
Additional Advantages

Makes distributed cross-match more feasible only
transfer positional columns from one site to
another.
Simple API - so users with their own data mining
code (clustering, classification, etc) can read
HDF5 datasets directly.
Easy to add access to tables in other formats
such as FITS, VOTable, CSV, or (using JDBC) those
in other DBMS.
Can provide user with integrated graphics,
statistics, etc. and a host of other desirable
data exploration features missing from commercial
RDBMS.

29
Column-oriented Data Explorer - Conclusions

Developing a column-oriented data storage system
using HDF5 is not a trivial exercise, but does
not appear to be too difficult.
It is not intended to be a general-purpose DBMS,
and should be seen as something complementing the
DBMS in a data archive system or astronomical
data warehouse.
Storage of same data in row-oriented and
column-oriented form may be seen as wasteful, but
disc is now cheap.

Write a Comment

User Comments (0)

About PowerShow.com

Support range of physical units and layouts. Radians, degre PowerPoint PPT Presentation