Support range of physical units and layouts. Radians, degre PowerPoint PPT Presentation

presentation player overlay
1 / 29
About This Presentation
Transcript and Presenter's Notes

Title: Support range of physical units and layouts. Radians, degre


1
  • Clive Page
  • University of Leicester
  • Meeting at ROE - 2005 January 25
  • Cross-matching Catalogues
  • Column-based storage for data exploring

2
(No Transcript)
3
Data Federation
  • In order to extract full scientific value from
    our datasets we have to combine information from
    different wavebands.
  • Data Federation is one of the principal aims of
    the VO projects but facilities to implement it
    seem to be very slow in arriving.
  • Federation can be done from images, but images
    are made in quite different ways in radio,
    optical, and X-ray bands, so combining them is
    difficult.
  • Usually images are searched for sources and these
    make source catalogues.

4
Cross-matching Catalogues
  • Catalogues are large tables, so normally stored
    in a relational database management system.
  • The principal matching criterion is approximate
    spatial coincidence, so the main part of the
    cross-matching process is a spatial join.
  • Most DBMS support spatial joins with some form of
    spatial index-
  • DB2, Informix, MySQL, Postgres all have spatial
    indexing built-in (mostly using R-trees).
  • Oracle, Sybase have optional spatial data
    modules (R-trees)
  • Microsoft SQL Server no support for spatial
    data.
  • If we want a generic solution for the VO it would
    be better to use a spatial join which only needs
    B-tree indexing.

5
Spatial Join Algorithms
  • R-trees (and similar spatial indices)
  • Easy to use, quite efficient, but syntax not
    standardised.
  • Sort/sweep algorithm of Dave Abel et al (CSIRO)
  • Cannot be done in DBMS extract data in binary
    form
  • Only efficient for large comparably sized tables.
  • Declination Zone method of Jim Grey et al
  • Seems to be efficient only on SQL Server.
  • Pixel-code method cover sky with grid of pixels
  • E.g. use HEALPix, HTM, or simple igloo grid.
  • Works quite well on all DBMS used so far.
  • Complicated we are joining regions (e.g.
    circles) not points, and regions often cross
    pixel-boundaries.

6
Standard Catalogue - TableS

7
Create Pixel-code table - PixelS

8
Ingest Users own table - TableU

9
Generate its pixel-table - PixelU

10
Create B-tree Indices

11
Join the four tables loop over rows of PixelU,
use indices on other three.

12
Pixel-code method - advantages
  • Seems to be efficient enough when small to medium
    sized table joined to very large one (more
    benchmarks needed).
  • Can be done with practically standard SQL, so
    should be usable on all DBMS.

13
Joining Catalogues four cases to consider
  • Cone-search find matches to single
    source/position.
  • User uploads own source-list to match with std
    catalogues
  • User selects subset from some standard catalogue,
    wants to cross-match with other large catalogues.
  • Cross-match of whole of one large catalogue with
    another.

14
Joining Catalogues practical cases
  • Cone-search find matches to single
    source/position.
  • Simple case, already supported by several
    services
  • User uploads own source-list to match with
    standard cats
  • User selects subset from some standard catalogue,
    wants to cross-match with other large catalogues.
  • Cross-match of whole of one large catalogue with
    another.
  • Substantial project use sort/sweep join outside
    DBMS?

15
Joining users Catalogue with Standard Catalogue
  • Practical problems especially on tables
    uploaded by the user
  • Support range of physical units and layouts
  • Radians, degrees, HMS/DMS, etc.
  • Various equinoxes, e.g. J2000, B1950 etc.
  • Error may be fixed for whole table, or specified
    by row(s).
  • Error regions may be circles, ellipses, or other
    shapes
  • Error values may be specified as 1-sigma, 90,
    etc.

16
Software Development Proposed
  • A little more development of pixel-code method,
    to benchmark and check efficiency with existing
    DBMS on a wider range of datasets.
  • Web application to allow users to upload their
    own tables
  • Convert them to a DBMS-compatible form
  • Ingest to DBMS
  • Create necessary pixel-code table, indices.
  • Needs way of allocating disc-space for users on
    the server e.g using MySpace (AstroGrid) or MyDB
    (NVO).

17
Cross-matching system - Conclusions
  • Fairly easy to develop required software to
    handle users tables and cross-match with standard
    catalogues.
  • Would provide new facilities that astronomers
    currently lack and which might actually get
    used.
  • A basic system could be set up in a matter of
    weeks (assuming the required MySpace/MyDB
    facilities were available).

18
And now for something completely different
  • Column-based data storage
  • for data exploration/mining

19
Data Exploration and Data Mining
  • Operations very often performed on tabular
    datasets.
  • Source catalogues, especially, are increasing in
    size
  • all-sky surveys, large format CCDs, higher time
    resolutions.
  • SDSS DR3 has about 1 Terabyte of tabular data.
  • Several catalogues now 109 rows in length.

20
Data Exploration and Data Mining
  • Operations very often performed on tabular
    datasets.
  • Source catalogues, especially, are increasing in
    size
  • all-sky surveys, large format CCDs, higher time
    resolutions.
  • SDSS DR3 has about 1 Terabyte of tabular data.
  • Several catalogues now 109 rows in length.
  • Width of tables also increasing

21
Storage of Tables
  • Tables are normally stored in relational DBMS
  • There are two obvious ways to store tabular data
  • Row-oriented, i.e. all fields in a row stored
    contiguously
  • Column-oriented, i.e. all values in a column
    stored as a vector in contiguous blocks on disc.
  • RDBMS are designed and optimised for transactions
    (insert/delete/update) so they almost always use
    row-oriented storage.

22
Can we keep up with Moores Law?
  • Astronomical data volume, processing power, disc
    storage doubling in about every 2 years.
  • I/O bandwidth and seek times are improving much
    more slowly (10 per year)
  • Increasingly we want to manipulate datasets which
    are too large to be held in memory (even though
    memory sizes are also increasing) so disc
    speeds are very important.
  • Conclusion I/O is becoming a more serious
    bottleneck.

23
Query Types
  • Index-accelerated queries
  • Selection on value (or range of values) in an
    indexed column.
  • Sequential scans necessary when
  • Creating an index
  • Populating new columns e.g. using UPDATE
  • Selecting on an expression
  • SELECT FROM table WHERE (bmag-vmag) 0.5
  • Computing statistics
  • Joining one table with another (must scan at
    least one table, can use index on the other)
  • Selecting, where index selectivity is too small (
    a random seek takes 10ms, can scan 500 disc
    blocks in that time).

24
Query Timing
  • Index-accelerated queries
  • Can return results in as little as 20 ms the
    time for a couple of disk-seeks.
  • Sequential scans of a table
  • Time depends (obviously) on table length
  • For tables of a billion rows, any scan operation
    takes at least 30 minutes.
  • Highly desirable to speed these up so can be
    interactive not batch operations.
  • For queries which scan large tables but only
    access a few columns out of many, row-based
    storage is very inefficient, column-oriented
    storage would be much faster.

25
Systems using column-oriented tables
  • Sybase-IQ best known commercial product from
    makers of Sybase-ASE relational database. Fast,
    but
  • No Linux version yet
  • No spatial indexing
  • Very expensive licence costs
  • ESO/MIDAS table system
  • STSDAS table system (part of IRAF)
  • FITS files can (just) store data in column-based
    form but these files too peculiar for most FITS
    utilities to cope.

26
Using HDF5 for column-oriented storage
  • Hierarchical Data Format 5
  • Comes from NCSA, well-supported library and tools
  • Compatible with Globus, GridFTP
  • Designed for efficient access to big files (no 2
    GB limit)
  • Simple API (bindings to C, Fortran90, Java,
    Python)
  • Can attach unlimited metadata to tables and
    columns.
  • Basic prototype was constructed and benchmarked
  • Used query parser and evaluator originally
    written for Starlinks CURSA about 1992.
  • Around 10 sample of 2MASS catalogue put into
    HDF5 column-based format.
  • Queries found to be 13 to 50 times faster using
    HDF5 than with fastest RDBMS (MySQL).

27
Development of an HDF5-based data explorer?
  • HDF5 library reliable, efficient, easy to use.
  • Query parser and evaluator CURSA, or Java
    library
  • Graphics package great variety of them
    available
  • Statistics routines many available
  • Indexing many B-tree libraries available
  • Spatial indexing free R-tree code exists, or
    use pixel-code methods.
  • Integration with VO for authentication, MySPACE,
    etc.
  • Provide as stand-alone package, and as a Web
    Service.
  • User interface allowing iterative filtering,
    undo, etc.
  • Probably the most difficult part of the design.

28
Additional Advantages
  • Makes distributed cross-match more feasible only
    transfer positional columns from one site to
    another.
  • Simple API - so users with their own data mining
    code (clustering, classification, etc) can read
    HDF5 datasets directly.
  • Easy to add access to tables in other formats
    such as FITS, VOTable, CSV, or (using JDBC) those
    in other DBMS.
  • Can provide user with integrated graphics,
    statistics, etc. and a host of other desirable
    data exploration features missing from commercial
    RDBMS.

29
Column-oriented Data Explorer - Conclusions
  • Developing a column-oriented data storage system
    using HDF5 is not a trivial exercise, but does
    not appear to be too difficult.
  • It is not intended to be a general-purpose DBMS,
    and should be seen as something complementing the
    DBMS in a data archive system or astronomical
    data warehouse.
  • Storage of same data in row-oriented and
    column-oriented form may be seen as wasteful, but
    disc is now cheap.
Write a Comment
User Comments (0)
About PowerShow.com