Title: Databases Meet Astronomy a db view of astronomy data
1Databases Meet Astronomya db view of astronomy
data
- Jim Gray
- Microsoft Research
- Collaborating with
- Alex Szalay, Peter Kunszt, Ani Thakar _at_ JHU
- Robert Brunner, Roy Williams _at_ Caltech
- George Djorgovski, Julian Bunn _at_ Caltech
2Outline
- Astronomy data
- The Virtual Observatory Concept
- The Sloan Digital Sky Survey
3ComputationalScience
- Traditional Empirical Science
- Scientist gathers data by direct observation
- Scientist analyzes data
- Computational Science
- Data captured by instrumentsOr data generated by
simulator - Processed by software
- Placed in a database
- Scientist analyzes database
4Astronomy Data Growth
- In the old days astronomers took photos.
- Starting in the 1960s they began to digitize.
- New instruments are digital (100s of GB/nite)
- Detectors are following Moores law.
- Data avalanche double every 2 years
Total area of 3m telescopes in the world in m2,
total number of CCD pixels in megapixel, as a
function of time. Growth over 25 years is a
factor of 30 in glass, 3000 in pixels.
3 M telescopes area m2
Courtesy of Alex Szalay
CCD area mpixels
5Universal Access to Astronomy Data
- Astronomers have a few Petabytes now.
- 1 pixel (byte) / sq arc second 4TB
- Multi-spectral, temporal, ? 1PB
- They mine it looking for new (kinds of) objects
or more of interesting ones (quasars),
density variations in 400-D space correlations
in 400-D space - Data doubles every 2 years.
- Data is public after 2 years.
- So, 50 of the data is public.
- Some have private access to 5 more data.
- So 50 vs 55 access for everyone
6Astronomy Data Publishing and Access
- But..
- How do I get at that 50 of the data?
- Astronomers have culture of publishing.
- FITS files and many tools.http//fits.gsfc.nasa.g
ov/fits_home.html - Encouraged by NASA.
- But, data details are hard to document.
Astronomers want to do it but it is VERY
hard.(What programs where used? What were the
processing steps? How were errors treated?) - The answer is 42.
7Astronomy Data Access
- And by the way, few astronomers have a spare
petabyte of storage in their pocket. - But that is getting better
- Public SDSS is 5 of total
- Public SDSS is 50GB 500GB of images (5TB raw)
- data fits on a 200 disk, 2000 computer.
- (more on that later).
- THESIS Challenging problems are publishing
data providing good query visualization
tools
8Astronomy Data Characteristics
- Lots of it (petabytes)
- Hundreds of dimensions per object
- Cross-correlation is challenging because
- Multi-resolution
- Time varying
- Data is dirty (cosmic rays, airplanes)
9Time and Spectral DimensionsThe Multiwavelength
Crab Nebulae
Crab star 1053 AD
X-ray, optical, infrared, and radio views of
the nearby Crab Nebula, which is now in a state
of chaotic expansion after a supernova explosion
first sighted in 1054 A.D. by Chinese Astronomers.
Slide courtesy of Robert Brunner _at_ CalTech.
10M51 in many wavelengths
Slide courtesy of Robert Brunner _at_ CalTech.
11Even in optical images are very different
Optical Near-Infrared Galaxy Image Mosaics
BJ RF IN J H K
Slide courtesy of Robert Brunner _at_ CalTech.
12Exploring Parameter SpaceManual or Automatic
Data Mining
- There is LOTS of data
- people cannot examine most of it.
- Need computers to do analysis.
- Manual or Automatic Exploration
- Manual person suggests hypothesis, computer
checks hypothesis - Automatic Computer suggests hypothesis person
evaluates significance - Given an arbitrary parameter space
- Data Clusters
- Points between Data Clusters
- Isolated Data Clusters
- Isolated Data Groups
- Holes in Data Clusters
- Isolated Points
Nichol et al. 2001 Slide courtesy of and adapted
fromRobert Brunner _at_ CalTech.
13Survey Cross-Identification
- Billions of Sources
- High Source Densities
- Multi-Wavelength Radio to g-Ray
- All Sky - Thousands of Sq. Degrees
- Computational Challenge
- Probabilistic Associations
- Optimized Likelihood Ratios
- A Priori Astrophysical Knowledge Important
- Secondary Parameters
- Temporal Variability
- Dynamic Static Associations
- User-Defined Cross-Identification Algorithms
Optical-Infrared-Radio Quasar-Environment Survey
Radio Survey Cross-Identification Steep Spectrum
Sources
Optical-Infrared-X-Ray Serendipitous Chandra
Identification
Slide courtesy of Robert Brunner _at_ CalTech.
14Data Federation A Computational Challenge
- 2MASS vs. DPOSS Cross-identification
- 2MASS J lt 15
- DPOSS IN lt 18
15Outline
- Astronomy data
- The Virtual Observatory Concept
- The Sloan Digital Sky Survey
16Virtual Observatoryhttp//www.astro.caltech.edu/n
voconf/http//www.voforum.org/
- Premise Most data is (or could be online)
- So, the Internet is the worlds best telescope
- It has data on every part of the sky
- In every measured spectral band optical, x-ray,
radio.. - As deep as the best instruments (2 years ago).
- It is up when you are up.The seeing is always
great (no working at night, no clouds no moons
no..). - Its a smart telescope links objects and
data to literature on them.
17The Age of Mega-Surveys
- Large number of new surveys
- multi-TB in size, 100 million objects or more
- Data publication an integral part of the survey
- Software bill a major cost in the survey
- The next generation mega-surveys are different
- top-down design
- large sky coverage
- sound statistical plans
- well controlled/documented data processing
- Each survey has a publication plan
- Federating these archives
- ? Virtual Observatory
MACHO 2MASS DENIS SDSS PRIME DPOSS GSC-II COBE
MAP NVSS FIRST GALEX ROSAT OGLE ...
Slide courtesy of Alex Szalay, modified by Jim
18Virtual Observatory Challenges
- Size multi-Petabyte
- 40,000 square degrees is 2 Trillion pixels
- One band (at 1 sq arcsec) 4 Terabytes
- Multi-wavelength 10-100
Terabytes - Time dimension gtgt 10 Petabytes
- Need auto parallelism tools
- Unsolved MetaData problem
- Hard to publish data programs
- How to federate Archives
- Hard to find/understand data programs
- Current tools inadequate
- new analysis visualization tools
- Data Federation is problematic
- Transition to the new astronomy
- Sociological issues
193-steps to Virtual Observatory
- Get SDSS and Palomar online
- Alex Szalay, Jan Vandenberg, Ani Thacker.
- Roy Williams, Robert Brunner, Julian Bunn
- Do some local queries and crossID matches with
CalTech and SDSS to expose - Schema, Units,
- Dataset problems
- the typical use scenarios.
- Implement Web Service that lets you get data from
both CalTech and SDSS (WSDL/SOAP/)
20Demo of VirtualSky
- Roy Williams _at_ CaltechPalomar Data with links to
NED. - Shows multiple themes, shows link to other sites
(NED, VizeR, Sinbad, ) - http//virtualsky.org/servlet/Page?T3S21P1X
0Y0W4F1 - And
- NED _at_ http//nedwww.ipac.caltech.edu/index.html
21Demo of Sky Server
- Alex Szalay of Johns Hopkins built SkyServer
(based on TerraServer design). - http//skyserver.fnal.gov/
22Virtual Observatory and Education
- In the beginning science was empirical.
- Then theoretical branches evolved.
- Now, we have a computational branches.
- The computational branch has been simulation
- It is becoming data analysis/visualization
- The Virtual Observatory can be used to
- Teach astronomy make it interactive,
demonstrate ideas and phenomena - Teach computational science skills
23What Next?(after the data online, after the web
servers)
- How to federate the Archives to make a VO?
- Send XML a non-answer equivalent to send
Unicode - Bytes is the wrong abstractionPublish Methods
on Objects. - Define a set of Astronomy Objects and methods.
- Based on UDDI, WSDL, SOAP.
- Each archive is a web service
- We have started this with TerraService
- http//TerraService.net/ shows the idea.
- Working with Caltech (Brunner, Williams,
Djorgovski, Bunn) and JHU (Szalay et al) on this
24SkyServer as a WebServerWSDLSOAPjust add
details ?
- Archive ss new VOService(SkyServer)
- Attributes A ss.GetObjects(ra,dec,radius)
-
- ?? What are the objects (attributes)?
- ?? What are the methods (GetObjects()...)?
- ?? Is the query language SQL or Xquery or what?
25Outline
- Astronomy data
- The Virtual Observatory Concept
- The Sloan Digital Sky Survey
26Sloan Digital Sky Survey http//www.sdss.org/
- For the last 12 years a group of astronomers has
been building a telescope (with funding from
Sloan Foundation, NSF, and a dozen universities).
90M. - Last year was engineer, calibrate, commission
They are making the calibration data public. - 5 of the survey, 600 sq degrees, 15 M objects
60GB. - This data includes most of the known high z
quasars. - It has a lot of science left in it but that is
just the start. - Now the data is arriving
- 250GB/nite (20 nights per year).
- 100 M stars, 100 M galaxies, 1 M spectra.
- http//www.sdss.org/ and http//www.sdss.jhu.edu/
27SDSS what I have been doing
- Work with Alex Szalay, Don Slutz, and others to
define 20 canonical queries and 10 visualization
tasks. - Don Slutz did a first cut of the queries, Im
continuing that work. - Working with Alex Szalay on building Sky Server
and making data it public (send out 80GB
SQL DBs)
28Two kinds of SDSS data
- 15M Photo Objects 400 attributes
20K Spectra with 10 lines/ spectrum
29Spatial Data Access(Szalay, Kunszt,
Brunner)http//www.sdss.jhu.edu/ look at the HTM
link
- Implemented Hierarchical Triangular Mesh (HTM) as
table-valued function for spatial joins. - Every object has a 20-deep Mesh ID.
- Given a spatial definitionRoutine returns up to
10 covering triangles. - Spatial query is then up to 10 range queries.
- Very fast 10,000 triangles / second / cpu.
30The 20 Queries
- Q11 Find all elliptical galaxies with spectra
that have an anomalous emission line. - Q12 Create a grided count of galaxies with u-ggt1
and rlt21.5 over 60ltdeclinationlt70, and 200ltright
ascensionlt210, on a grid of 2, and create a map
of masks over the same grid. - Q13 Create a count of galaxies for each of the
HTM triangles which satisfy a certain color cut,
like 0.7u-0.5g-0.2ilt1.25 rlt21.75, output it in
a form adequate for visualization. - Q14 Find stars with multiple measurements and
have magnitude variations gt0.1. Scan for stars
that have a secondary object (observed at a
different time) and compare their magnitudes. - Q15 Provide a list of moving objects consistent
with an asteroid. - Q16 Find all objects similar to the colors of a
quasar at 5.5ltredshiftlt6.5. - Q17 Find binary stars where at least one of them
has the colors of a white dwarf. - Q18 Find all objects within 30 arcseconds of one
another that have very similar colors that is
where the color ratios u-g, g-r, r-I are less
than 0.05m. - Q19 Find quasars with a broad absorption line in
their spectra and at least one galaxy within 10
arcseconds. Return both the quasars and the
galaxies. - Q20 For each galaxy in the BCG data set
(brightest color galaxy), in 160ltright
ascensionlt170, -25ltdeclinationlt35 count of
galaxies within 30"of it that have a photoz
within 0.05 of that galaxy.
- Q1 Find all galaxies without unsaturated pixels
within 1' of a given point of ra75.327,
dec21.023 - Q2 Find all galaxies with blue surface
brightness between and 23 and 25 mag per square
arcseconds, and -10ltsuper galactic latitude (sgb)
lt10, and declination less than zero. - Q3 Find all galaxies brighter than magnitude 22,
where the local extinction is gt0.75. - Q4 Find galaxies with an isophotal surface
brightness (SB) larger than 24 in the red band,
with an ellipticitygt0.5, and with the major axis
of the ellipse having a declination of between
30 and 60arc seconds. - Q5 Find all galaxies with a deVaucouleours
profile (r¼ falloff of intensity on disk) and the
photometric colors consistent with an elliptical
galaxy. The deVaucouleours profile - Q6 Find galaxies that are blended with a star,
output the deblended galaxy magnitudes. - Q7 Provide a list of star-like objects that are
1 rare. - Q8 Find all objects with unclassified spectra.
- Q9 Find quasars with a line width gt2000 km/s and
2.5ltredshiftlt2.7. - Q10 Find galaxies with spectra that have an
equivalent width in Ha gt40Ã… (Ha is the main
hydrogen spectral line.)
Also some good queries at http//www.sdss.jhu.edu
/ScienceArchive/sxqt/sxQT/Example_Queries.html
31An easy oneQ7 Provide a list of star-like
objects that are 1 rare.
- Found 14,681 buckets, first 140 buckets have
99 time 104 seconds - Disk bound, reads 3 disks at 68 MBps.
Select cast((u-g) as int) as ug, cast((g-r) as
int) as gr, cast((r-i) as int) as ri,
cast((i-z) as int) as iz, count()
as Population from stars group by cast((u-g) as
int), cast((g-r) as int), cast((r-i) as int),
cast((i-z) as int) order by count()
32An easy oneQ15 Provide a list of moving objects
consistent with an asteroid.
- Sounds hard but there are 5 pictures of the
object at 5 different times (colors) and so can
compute velocity. - Image pipeline computes velocity.
- Computing it from the 5 color x,y would also be
fast - Finds 2167 objects in 7 minutes, 70MBps.
select object_id, -- return object ID
sqrt(power(rowv,2)power(colv,2)) as velocity
from sxPhotObj -- check each
object. where (power(rowv,2) power(colv, 2)) gt
50 -- square of velocity and rowv gt 0
and colv gt0 -- negative values indicate error
33Q15 Fast Moving Objects
- Find near earth asteroids
- SELECT r.objID as rId, g.objId as gId, r.run,
r.camcol, r.field as field, g.field as gField, - r.ra as ra_r, r.dec as dec_r, g.ra as ra_g,
g.dec as dec_g, - sqrt( power(r.cx -g.cx,2) power(r.cy-g.cy,2)pow
er(r.cz-g.cz,2) )(10800/PI()) as distance - FROM PhotoObj r, PhotoObj g
- WHERE
- r.run g.run and r.camcolg.camcol and
abs(g.field-r.field)lt2 -- the match criteria - -- the red selection criteria
- and ((power(r.q_r,2) power(r.u_r,2)) gt
0.111111 ) - and r.fiberMag_r between 6 and 22 and
r.fiberMag_r lt r.fiberMag_g and r.fiberMag_r lt
r.fiberMag_i - and r.parentID0 and r.fiberMag_r lt r.fiberMag_u
and r.fiberMag_r lt r.fiberMag_z - and r.isoA_r/r.isoB_r gt 1.5 and r.isoA_rgt2.0
- -- the green selection criteria
- and ((power(g.q_g,2) power(g.u_g,2)) gt
0.111111 ) - and g.fiberMag_g between 6 and 22 and
g.fiberMag_g lt g.fiberMag_r and g.fiberMag_g lt
g.fiberMag_i - and g.fiberMag_g lt g.fiberMag_u and g.fiberMag_g
lt g.fiberMag_z - and g.parentID0 and g.isoA_g/g.isoB_g gt 1.5 and
g.isoA_g gt 2.0 - -- the matchup of the pair
- and sqrt(power(r.cx -g.cx,2) power(r.cy-g.cy,2)
power(r.cz-g.cz,2))(10800/PI())lt 4.0
34(No Transcript)
35A Hard One Q14 Find stars with multiple
measurements that have magnitude variations
gt0.1.
- This should work, but SQL Server does not allow
table values to be piped to table-valued
functions.
- This should work, but SQL Server does not allow
table values to be piped to table-valued
functions.
36A Hard one Second TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.
- Write a program with a cursor, ran for 2 days
--------------------------------------------------
----------------------------- -- Table-valued
function that returns the binary stars within a
certain radius -- of another (in arc-minutes)
(typically 5 arc seconds). -- Returns the ID
pairs and the distance between them (in
arcseconds). create function BinaryStars(_at_MaxDista
nceArcMins float) returns _at_BinaryCandidatesTable
table( S1_object_ID bigint not null, -- Star
1 S2_object_ID bigint not null, -- Star
2 distance_arcSec float) -- distance between
them as begin declare _at_star_ID bigint,
_at_binary_ID bigint-- Star's ID and binary ID
declare _at_ra float, _at_dec float -- Star's
position declare _at_u float, _at_g float, _at_r float,
_at_i float,_at_z float -- Star's colors Â
----------------Open a cursor over stars and get
position and colors declare star_cursor cursor
for select object_ID, ra, dec, u, g, r, i,
z from Stars open star_cursor  while
(11) -- for each star begin -- get its
attribues fetch next from star_cursor into
_at_star_ID, _at_ra, _at_dec, _at_u, _at_g, _at_r, _at_i, _at_z if
(_at__at_fetch_status -1) break -- end if no more
stars insert into _at_BinaryCandidatesTable --
insert its binaries select _at_star_ID,
S1.object_ID, -- return stars pairs
sqrt(N.DotProd)/PI()10800 -- and distance in
arc-seconds from getNearbyObjEq(_at_ra, _at_dec,
-- Find objects nearby S. _at_MaxDistanceArcMins)
as N, -- call them N. Stars as S1 --
S1 gets N's color values where _at_star_ID lt
N.Object_ID -- S1 different from S and
N.objType dbo.PhotoType('Star') -- S1 is a
star and N.object_ID S1.object_ID -- join
stars to get colors of S1N and
(abs(_at_u-S1.u) gt 0.1 -- one of the colors is
different. or abs(_at_g-S1.g) gt 0.1 or
abs(_at_r-S1.r) gt 0.1 or abs(_at_i-S1.i) gt 0.1
or abs(_at_z-S1.z) gt 0.1 ) end -- end
of loop over all stars -------------- Looped
over all stars, close cursor and exit. close
star_cursor -- deallocate star_cursor
return -- return table end -- end of
BinaryStars GO select from dbo.BinaryStars(.05)
37A Hard one Third TryQ14 Find stars with
multiple measurements that have magnitude
variations gt0.1.
- Use pre-computed neighbors table.
- Ran in 17 minutes, found 31k pairs.
-- Plan 2 Use
the precomputed neighbors table select top 100
S.object_ID, S1.object_ID, -- return star pairs
and distance str(N.Distance_mins 60,6,1) as
DistArcSec from Stars S, -- S is a
star Neighbors N, -- N within 3 arcsec (10
pixels) of S. Stars S1 -- S1 N has the
color attibutes where S.Object_ID
N.Object_ID -- connect S and N. and
S.Object_ID lt N.Neighbor_Object_ID -- S1
different from S and N.Neighbor_objType
dbo.PhotoType('Star')-- S1 is a star (an
optimization) and N.Distance_mins lt .05 --
the 3 arcsecond test and N.Neighbor_object_ID
S1.Object_ID -- N S1 and (
abs(S.u-S1.u) gt 0.1 -- one of the colors is
different. or abs(S.g-S1.g) gt 0.1 or
abs(S.r-S1.r) gt 0.1 or abs(S.i-S1.i) gt 0.1 or
abs(S.z-S1.z) gt 0.1 ) -- Found 31,355 pairs
(out of 4.4 m stars) in 17 min 14 sec.
38The Pain of Going Outside SQL(its fortunate that
all the queries are single statements)
- Use a cursor
- No cpu parallelism
- CPU bound
- 6 MBps, 2.7 k rps
- 5,450 seconds (10x slower)
- Count parent objects
- 503 seconds for 14.7 M objects in 33.3 GB
- 66 MBps
- IO bound (30 of one cpu)
- 100 k records/cpu sec
declare _at_count int declare _at_sum int set _at_sum
0 declare PhotoCursor cursor for select nChild
from sxPhotoObj open PhotoCursor while (11)
begin fetch next from PhotoCursor into
_at_count if (_at__at_fetch_status -1) break set
_at_sum _at_sum _at_count end close
PhotoCursor deallocate PhotoCursor print 'Sum
is 'cast(_at_sum as varchar(12))
select count() from sxPhotoObj where nChild
gt 0
39(No Transcript)
40(No Transcript)
41Performance
- Run times on 3k PC (2 cpu, 4 disk, 256MB)
- Some take 10 minutes
- Some take 1 minute
- Most take 15 seconds.
- Ghz processors are fast!
- (10 mips/IO, 200 ins/IO)
100 IO/cpu sec 5MB/cpu sec
42Summary of Queries
- 16 of the queries are simple
- 2 are iterative, 2 are unknown
- Many are sequential one-pass and two-pass over
data - Covering indices make scans run fast
- Table valued functions are wonderful but
limitations on parameters are a pain. - Counting is VERY common.
- Binning (grouping by some set of attributes) is
common - Did not request cube, but that may be cultural.
43Reflections on the 20 Queries
- Data loading/scrubbing is labor intensive
tedious - AUTOMATE!!!
- This is 5 of the data, and some queries take 1/2
hour. - But this is not tuned (disk bound).
- All queries benefit from parallelism (both disk
and cpu)(if you can state the query inside SQL). - Parallel database machines will do well on this
- Hash machines
- Data pumps
- See paper in word or pdf on my web site.
- Conclusion SQL looks good.Once you get the
answers, you need visualization
44Call to Action
- If you do data visualization we need you(and we
know it). - If you do databaseshere is some data you can
practice on. - If you do distributed systemshere is a
federation you can practice on. - If you do astronomy educational outreachhere is
a tool for you. - The astronomy folks are very good, and very
smart, and a pleasure to work with, and the
questions are cosmic, so