Title: Astronomy Data Bases
1Astronomy Data Bases
- Jim Gray
- Microsoft Research
2The Evolution of Science
- Observational Science
- Scientist gathers data by direct observation
- Scientist analyzes data
- Analytical Science
- Scientist builds analytical model
- Makes predictions.
- Computational Science
- Simulate analytical model
- Validate model and makes predictions
- Data Exploration Science Data captured by
instrumentsOr data generated by simulator - Processed by software
- Placed in a database / files
- Scientist analyzes database / files
3Computational Science Evolves
- Historically, Computational Science simulation.
- New emphasis on informatics
- Capturing,
- Organizing,
- Summarizing,
- Analyzing,
- Visualizing
- Largely driven by observational science, but
also needed by simulations. - Too soon to say if comp-X and X-info will unify
or compete.
BaBar, Stanford
PE Gene Sequencer From http//www.genome.uci.edu
/
Space Telescope
4Information Avalanche
- Both
- better observational instruments and
- Better simulations
- are producing a data avalanche
- Examples
- Turbulence 100 TB simulation then mine the
Information - BaBar Grows 1TB/day 2/3 simulation Information
1/3 observational Information - CERN LHC will generate 1GB/s 10 PB/y
- VLBA (NRAO) generates 1GB/s today
- NCBI only ½ TB but doubling each year, very
rich dataset. - Pixar 100 TB/Movie
Images courtesy of Charles Meneveau Alex Szalay
_at_ JHU
5Whats X-info Needs from us (cs)(not drawn to
scale)
6Next-Generation Data Analysis
- Looking for
- Needles in haystacks the Higgs particle
- Haystacks Dark matter, Dark energy
- Needles are easier than haystacks
- Global statistics have poor scaling
- Correlation functions are N2, likelihood
techniques N3 - As data and computers grow at same rate, we can
only keep up with N logN - A way out?
- Discard notion of optimal (data is fuzzy, answers
are approximate) - Dont assume infinite computational resources or
memory - Requires combination of statistics computer
science
7Analysis and Databases
- Much statistical analysis deals with
- Creating uniform samples
- data filtering
- Assembling relevant subsets
- Estimating completeness
- censoring bad data
- Counting and building histograms
- Generating Monte-Carlo subsets
- Likelihood calculations
- Hypothesis testing
- Traditionally these are performed on files
- Most of these tasks are much better done inside a
database - Move Mohamed to the mountain, not the mountain to
Mohamed.
8Data Access is hitting a wallFTP and GREP are
not adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 5,000 disks
- At some point you need indices to limit
search parallel data search and analysis - This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min ( 1 /GB)
- 2 days and 1K
- 3 years and 1M
9Data Federations of Web Services
- Massive datasets live near their owners
- Near the instruments software pipeline
- Near the applications
- Near data knowledge and curation
- Super Computer centers become Super Data Centers
- Each Archive publishes a web service
- Schema documents the data
- Methods on objects (queries)
- Scientists get personalized extracts
- Uniform access to multiple Archives
- A common global schema
Federation
10Web Services The Key?
- Web SERVER
- Given a url parameters
- Returns a web page (often dynamic)
- Web SERVICE
- Given a XML document (soap msg)
- Returns an XML document
- Tools make this look like an RPC.
- F(x,y,z) returns (u, v, w)
- Distributed objects for the web.
- naming, discovery, security,..
- Internet-scale distributed computing
Your program
Web Server
http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
11Grid and Web Services Synergy
- I believe the Grid will be many web services
- IETF standards Provide
- Naming
- Authorization / Security / Privacy
- Distributed Objects
- Discovery, Definition, Invocation, Object Model
- Higher level services workflow, transactions,
DB,.. - Synergy commercial Internet Grid tools
12World Wide TelescopeVirtual Observatoryhttp//w
ww.astro.caltech.edu/nvoconf/http//www.voforum.o
rg/
- Premise Most data is (or could be online)
- So, the Internet is the worlds best telescope
- It has data on every part of the sky
- In every measured spectral band optical, x-ray,
radio.. - As deep as the best instruments (2 years ago).
- It is up when you are up.The seeing is always
great (no working at night, no clouds no moons
no..). - Its a smart telescope links objects and
data to literature on them.
13Why Astronomy Data?
- It has no commercial value
- No privacy concerns
- Can freely share results with others
- Great for experimenting with algorithms
- It is real and well documented
- High-dimensional data (with confidence
intervals) - Spatial data
- Temporal data
- Many different instruments from many different
places and many different times - Federation is a goal
- There is a lot of it (petabytes)
- Great sandbox for data mining algorithms
- Can share cross company
- University researchers
- Great way to teach both Astronomy and
Computational Science
14Put Your Data In a File?
- Simple
- Reliable
- Common Practice
- Matches C/Java/programming model (streams)
- Metadata in programnot in database
- Recovery is old-master new-masterrather than
transaction - Procedural access for queries
- No indices unless you do it yourself
- No parallelismunless you do it yourself
15Put Your Data In a DB?
- Schematized Schema evolution Data
independence - Reliable transactions, online backup,..
- Query tools parallelism non procedural
- Scales to large datasets
- Web services tools
- Complicated
- New programming model
- Depend on a vendorall give an extended subset
of the standard - Expensive
Product X
sql
16My Conclusion
- Despite the drawbacks
- DB is the only choice for large datasets for
complex datasets (schema) for complex
query for shared access (read write) - But try to present standard SQL
- Power users need full power of SQL
17The SDSS Experience
- It takes a village. MANY different skills
18The SDSS Experience not all DBMSs are DBMSs
- DB1 ? Schema evolves. ? crash reload on
evolution. ? no easy way to evolve ? No query
tools ? Poor indices ? Dismal sequential
performance (.5MB/s) ? Had to build their own
parallelism. - This database system had virtually none of the
DB benefitsand all of the DB pain.
19The SDSS Experience
- DB2 (a fairly pure relational system) ? Schema
evolution was easy. ? Query tools, indices,
parallelism works ? Many admin tools for
loading ? Good sequential performance (1 GB/s,
5 M records/second/cpu) ? Reliable - Had good vendor support (me)
- Seduced by vendor extensions
- Some query optimizer bugs (bad plans) are a
constant nuisance.
20Astronomy DBs
- Data starts with Pixels (10s of TB today)
- Optical is pixels (flux _at_ (ra,dec))
- Radio is cube (f(band)_at_ (ra,dec))
- Many things vary with time
- Pixels converted to objects (Billions today)
- _at_(ra,dec) hundreds of attributes, each with
estimated error - Most queries on object space.
- Drill down to pixel space or to cube.
- Many queries are spatial need HTM or ..
21Demo
- Show pixel space and object space explorers.
22A Simple Schema
Photo
Spectro
23How to Design the Database?
- Decide what it is for 20 questions approach has
worked well - Design it to answer those 20 questions
- Iterate (it is easy to change designs).
- BUT.. Be careful about names
- reddening ? extinction causes problemsfuzzy
definitions cause problemsdocumenting what a
value means is hard
24The Answer is 42
- But what is the accuracy and precision?
- What is the derivation?
- Needs a man page
25The SDSS Experience
- DB has worked out well
- Tools are very important (especially data
loading) - Integration with web servers/services is very
important - Need more than single-node parallelism
- Need better query plans
- But overall a success.
- Have been able to clone it for several other
datasets (FIRST, 2MASS, SSS, INT) - Database replicated at many sites (25?)
- Built an interesting data-ingest system.
26Traffic Analysis
- SDSS DR1 has been online for a while.
- Peak hour is 12M records/hour
- Peak query is 500,000 rows (limit)
27The Future
- Things will get better.
- Code is moving into the DB
- easier to add spatial and other functionsbetter
performanceNo Inside/Outside dichotomy - XML Schema (XSD) describes data on the wire.
- I love DataSets (an schematized network of
records ) - XSD described
- collections of record sets
- With foreign keys
- With updategrams
- XML and xQuery is comingThis may help some
things This may confuse things (more
choices)Probably both.