Astronomy Data Bases - PowerPoint PPT Presentation

About This Presentation

Title:

Astronomy Data Bases

Description:

Astronomy Data Bases Jim Gray Microsoft Research The Evolution of Science Observational Science Scientist gathers data by direct observation Scientist analyzes data ... – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 28

Provided by: JimG52

Category:

more less

Transcript and Presenter's Notes

Title: Astronomy Data Bases

1
Astronomy Data Bases

Jim Gray
Microsoft Research

2
The Evolution of Science

Observational Science
Scientist gathers data by direct observation
Scientist analyzes data
Analytical Science
Scientist builds analytical model
Makes predictions.
Computational Science
Simulate analytical model
Validate model and makes predictions
Data Exploration Science Data captured by
instrumentsOr data generated by simulator
Processed by software
Placed in a database / files
Scientist analyzes database / files

3
Computational Science Evolves

Historically, Computational Science simulation.
New emphasis on informatics
Capturing,
Organizing,
Summarizing,
Analyzing,
Visualizing
Largely driven by observational science, but
also needed by simulations.
Too soon to say if comp-X and X-info will unify
or compete.

BaBar, Stanford
PE Gene Sequencer From http//www.genome.uci.edu
/
Space Telescope
4
Information Avalanche

Both
better observational instruments and
Better simulations
are producing a data avalanche
Examples
Turbulence 100 TB simulation then mine the
Information
BaBar Grows 1TB/day 2/3 simulation Information
1/3 observational Information
CERN LHC will generate 1GB/s 10 PB/y
VLBA (NRAO) generates 1GB/s today
NCBI only ½ TB but doubling each year, very
rich dataset.
Pixar 100 TB/Movie

Images courtesy of Charles Meneveau Alex Szalay
_at_ JHU
5
Whats X-info Needs from us (cs)(not drawn to
scale)
6
Next-Generation Data Analysis

Looking for
Needles in haystacks the Higgs particle
Haystacks Dark matter, Dark energy
Needles are easier than haystacks
Global statistics have poor scaling
Correlation functions are N2, likelihood
techniques N3
As data and computers grow at same rate, we can
only keep up with N logN
A way out?
Discard notion of optimal (data is fuzzy, answers
are approximate)
Dont assume infinite computational resources or
memory
Requires combination of statistics computer
science

7
Analysis and Databases

Much statistical analysis deals with
Creating uniform samples
data filtering
Assembling relevant subsets
Estimating completeness
censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
Traditionally these are performed on files
Most of these tasks are much better done inside a
database
Move Mohamed to the mountain, not the mountain to
Mohamed.

8
Data Access is hitting a wallFTP and GREP are
not adequate

You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
Oh!, and 1PB 5,000 disks
At some point you need indices to limit
search parallel data search and analysis
This is where databases can help

You can FTP 1 MB in 1 sec
You can FTP 1 GB / min ( 1 /GB)
2 days and 1K
3 years and 1M

9
Data Federations of Web Services

Massive datasets live near their owners
Near the instruments software pipeline
Near the applications
Near data knowledge and curation
Super Computer centers become Super Data Centers
Each Archive publishes a web service
Schema documents the data
Methods on objects (queries)
Scientists get personalized extracts
Uniform access to multiple Archives
A common global schema

Federation
10
Web Services The Key?

Web SERVER
Given a url parameters
Returns a web page (often dynamic)
Web SERVICE
Given a XML document (soap msg)
Returns an XML document
Tools make this look like an RPC.
F(x,y,z) returns (u, v, w)
Distributed objects for the web.
naming, discovery, security,..
Internet-scale distributed computing

Your program
Web Server
http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
11
Grid and Web Services Synergy

I believe the Grid will be many web services
IETF standards Provide
Naming
Authorization / Security / Privacy
Distributed Objects
Discovery, Definition, Invocation, Object Model
Higher level services workflow, transactions,
DB,..
Synergy commercial Internet Grid tools

12
World Wide TelescopeVirtual Observatoryhttp//w
ww.astro.caltech.edu/nvoconf/http//www.voforum.o
rg/

Premise Most data is (or could be online)
So, the Internet is the worlds best telescope
It has data on every part of the sky
In every measured spectral band optical, x-ray,
radio..
As deep as the best instruments (2 years ago).
It is up when you are up.The seeing is always
great (no working at night, no clouds no moons
no..).
Its a smart telescope links objects and
data to literature on them.

13
Why Astronomy Data?

It has no commercial value
No privacy concerns
Can freely share results with others
Great for experimenting with algorithms
It is real and well documented
High-dimensional data (with confidence
intervals)
Spatial data
Temporal data
Many different instruments from many different
places and many different times
Federation is a goal
There is a lot of it (petabytes)
Great sandbox for data mining algorithms
Can share cross company
University researchers
Great way to teach both Astronomy and
Computational Science

14
Put Your Data In a File?

Simple
Reliable
Common Practice
Matches C/Java/programming model (streams)

Metadata in programnot in database
Recovery is old-master new-masterrather than
transaction
Procedural access for queries
No indices unless you do it yourself
No parallelismunless you do it yourself

15
Put Your Data In a DB?

Schematized Schema evolution Data
independence
Reliable transactions, online backup,..
Query tools parallelism non procedural
Scales to large datasets
Web services tools

Complicated
New programming model
Depend on a vendorall give an extended subset
of the standard
Expensive

Product X
sql
16
My Conclusion

Despite the drawbacks
DB is the only choice for large datasets for
complex datasets (schema) for complex
query for shared access (read write)
But try to present standard SQL
Power users need full power of SQL

17
The SDSS Experience

It takes a village. MANY different skills

18
The SDSS Experience not all DBMSs are DBMSs

DB1 ? Schema evolves. ? crash reload on
evolution. ? no easy way to evolve ? No query
tools ? Poor indices ? Dismal sequential
performance (.5MB/s) ? Had to build their own
parallelism.
This database system had virtually none of the
DB benefitsand all of the DB pain.

19
The SDSS Experience

DB2 (a fairly pure relational system) ? Schema
evolution was easy. ? Query tools, indices,
parallelism works ? Many admin tools for
loading ? Good sequential performance (1 GB/s,
5 M records/second/cpu) ? Reliable
Had good vendor support (me)
Seduced by vendor extensions
Some query optimizer bugs (bad plans) are a
constant nuisance.

20
Astronomy DBs

Data starts with Pixels (10s of TB today)
Optical is pixels (flux _at_ (ra,dec))
Radio is cube (f(band)_at_ (ra,dec))
Many things vary with time
Pixels converted to objects (Billions today)
_at_(ra,dec) hundreds of attributes, each with
estimated error
Most queries on object space.
Drill down to pixel space or to cube.
Many queries are spatial need HTM or ..

21
Demo

Show pixel space and object space explorers.

22
A Simple Schema
Photo
Spectro
23
How to Design the Database?

Decide what it is for 20 questions approach has
worked well
Design it to answer those 20 questions
Iterate (it is easy to change designs).
BUT.. Be careful about names
reddening ? extinction causes problemsfuzzy
definitions cause problemsdocumenting what a
value means is hard

24
The Answer is 42

But what is the accuracy and precision?
What is the derivation?
Needs a man page

25
The SDSS Experience

DB has worked out well
Tools are very important (especially data
loading)
Integration with web servers/services is very
important
Need more than single-node parallelism
Need better query plans
But overall a success.
Have been able to clone it for several other
datasets (FIRST, 2MASS, SSS, INT)
Database replicated at many sites (25?)
Built an interesting data-ingest system.

26
Traffic Analysis

SDSS DR1 has been online for a while.
Peak hour is 12M records/hour
Peak query is 500,000 rows (limit)

27
The Future

Things will get better.
Code is moving into the DB
easier to add spatial and other functionsbetter
performanceNo Inside/Outside dichotomy
XML Schema (XSD) describes data on the wire.
I love DataSets (an schematized network of
records )
XSD described
collections of record sets
With foreign keys
With updategrams
XML and xQuery is comingThis may help some
things This may confuse things (more
choices)Probably both.