Astronomy Data Bases - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Astronomy Data Bases

Description:

Astronomy Data Bases Jim Gray Microsoft Research The Evolution of Science Observational Science Scientist gathers data by direct observation Scientist analyzes data ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 28
Provided by: JimG52
Category:

less

Transcript and Presenter's Notes

Title: Astronomy Data Bases


1
Astronomy Data Bases
  • Jim Gray
  • Microsoft Research

2
The Evolution of Science
  • Observational Science
  • Scientist gathers data by direct observation
  • Scientist analyzes data
  • Analytical Science
  • Scientist builds analytical model
  • Makes predictions.
  • Computational Science
  • Simulate analytical model
  • Validate model and makes predictions
  • Data Exploration Science Data captured by
    instrumentsOr data generated by simulator
  • Processed by software
  • Placed in a database / files
  • Scientist analyzes database / files

3
Computational Science Evolves
  • Historically, Computational Science simulation.
  • New emphasis on informatics
  • Capturing,
  • Organizing,
  • Summarizing,
  • Analyzing,
  • Visualizing
  • Largely driven by observational science, but
    also needed by simulations.
  • Too soon to say if comp-X and X-info will unify
    or compete.

BaBar, Stanford
PE Gene Sequencer From http//www.genome.uci.edu
/
Space Telescope
4
Information Avalanche
  • Both
  • better observational instruments and
  • Better simulations
  • are producing a data avalanche
  • Examples
  • Turbulence 100 TB simulation then mine the
    Information
  • BaBar Grows 1TB/day 2/3 simulation Information
    1/3 observational Information
  • CERN LHC will generate 1GB/s 10 PB/y
  • VLBA (NRAO) generates 1GB/s today
  • NCBI only ½ TB but doubling each year, very
    rich dataset.
  • Pixar 100 TB/Movie

Images courtesy of Charles Meneveau Alex Szalay
_at_ JHU
5
Whats X-info Needs from us (cs)(not drawn to
scale)
6
Next-Generation Data Analysis
  • Looking for
  • Needles in haystacks the Higgs particle
  • Haystacks Dark matter, Dark energy
  • Needles are easier than haystacks
  • Global statistics have poor scaling
  • Correlation functions are N2, likelihood
    techniques N3
  • As data and computers grow at same rate, we can
    only keep up with N logN
  • A way out?
  • Discard notion of optimal (data is fuzzy, answers
    are approximate)
  • Dont assume infinite computational resources or
    memory
  • Requires combination of statistics computer
    science

7
Analysis and Databases
  • Much statistical analysis deals with
  • Creating uniform samples
  • data filtering
  • Assembling relevant subsets
  • Estimating completeness
  • censoring bad data
  • Counting and building histograms
  • Generating Monte-Carlo subsets
  • Likelihood calculations
  • Hypothesis testing
  • Traditionally these are performed on files
  • Most of these tasks are much better done inside a
    database
  • Move Mohamed to the mountain, not the mountain to
    Mohamed.

8
Data Access is hitting a wallFTP and GREP are
not adequate
  • You can GREP 1 MB in a second
  • You can GREP 1 GB in a minute
  • You can GREP 1 TB in 2 days
  • You can GREP 1 PB in 3 years.
  • Oh!, and 1PB 5,000 disks
  • At some point you need indices to limit
    search parallel data search and analysis
  • This is where databases can help
  • You can FTP 1 MB in 1 sec
  • You can FTP 1 GB / min ( 1 /GB)
  • 2 days and 1K
  • 3 years and 1M

9
Data Federations of Web Services
  • Massive datasets live near their owners
  • Near the instruments software pipeline
  • Near the applications
  • Near data knowledge and curation
  • Super Computer centers become Super Data Centers
  • Each Archive publishes a web service
  • Schema documents the data
  • Methods on objects (queries)
  • Scientists get personalized extracts
  • Uniform access to multiple Archives
  • A common global schema

Federation
10
Web Services The Key?
  • Web SERVER
  • Given a url parameters
  • Returns a web page (often dynamic)
  • Web SERVICE
  • Given a XML document (soap msg)
  • Returns an XML document
  • Tools make this look like an RPC.
  • F(x,y,z) returns (u, v, w)
  • Distributed objects for the web.
  • naming, discovery, security,..
  • Internet-scale distributed computing

Your program
Web Server
http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
11
Grid and Web Services Synergy
  • I believe the Grid will be many web services
  • IETF standards Provide
  • Naming
  • Authorization / Security / Privacy
  • Distributed Objects
  • Discovery, Definition, Invocation, Object Model
  • Higher level services workflow, transactions,
    DB,..
  • Synergy commercial Internet Grid tools

12
World Wide TelescopeVirtual Observatoryhttp//w
ww.astro.caltech.edu/nvoconf/http//www.voforum.o
rg/
  • Premise Most data is (or could be online)
  • So, the Internet is the worlds best telescope
  • It has data on every part of the sky
  • In every measured spectral band optical, x-ray,
    radio..
  • As deep as the best instruments (2 years ago).
  • It is up when you are up.The seeing is always
    great (no working at night, no clouds no moons
    no..).
  • Its a smart telescope links objects and
    data to literature on them.

13
Why Astronomy Data?
  • It has no commercial value
  • No privacy concerns
  • Can freely share results with others
  • Great for experimenting with algorithms
  • It is real and well documented
  • High-dimensional data (with confidence
    intervals)
  • Spatial data
  • Temporal data
  • Many different instruments from many different
    places and many different times
  • Federation is a goal
  • There is a lot of it (petabytes)
  • Great sandbox for data mining algorithms
  • Can share cross company
  • University researchers
  • Great way to teach both Astronomy and
    Computational Science

14
Put Your Data In a File?
  • Simple
  • Reliable
  • Common Practice
  • Matches C/Java/programming model (streams)
  • Metadata in programnot in database
  • Recovery is old-master new-masterrather than
    transaction
  • Procedural access for queries
  • No indices unless you do it yourself
  • No parallelismunless you do it yourself

15
Put Your Data In a DB?
  • Schematized Schema evolution Data
    independence
  • Reliable transactions, online backup,..
  • Query tools parallelism non procedural
  • Scales to large datasets
  • Web services tools
  • Complicated
  • New programming model
  • Depend on a vendorall give an extended subset
    of the standard
  • Expensive

Product X
sql
16
My Conclusion
  • Despite the drawbacks
  • DB is the only choice for large datasets for
    complex datasets (schema) for complex
    query for shared access (read write)
  • But try to present standard SQL
  • Power users need full power of SQL

17
The SDSS Experience
  • It takes a village. MANY different skills

18
The SDSS Experience not all DBMSs are DBMSs
  • DB1 ? Schema evolves. ? crash reload on
    evolution. ? no easy way to evolve ? No query
    tools ? Poor indices ? Dismal sequential
    performance (.5MB/s) ? Had to build their own
    parallelism.
  • This database system had virtually none of the
    DB benefitsand all of the DB pain.

19
The SDSS Experience
  • DB2 (a fairly pure relational system) ? Schema
    evolution was easy. ? Query tools, indices,
    parallelism works ? Many admin tools for
    loading ? Good sequential performance (1 GB/s,
    5 M records/second/cpu) ? Reliable
  • Had good vendor support (me)
  • Seduced by vendor extensions
  • Some query optimizer bugs (bad plans) are a
    constant nuisance.

20
Astronomy DBs
  • Data starts with Pixels (10s of TB today)
  • Optical is pixels (flux _at_ (ra,dec))
  • Radio is cube (f(band)_at_ (ra,dec))
  • Many things vary with time
  • Pixels converted to objects (Billions today)
  • _at_(ra,dec) hundreds of attributes, each with
    estimated error
  • Most queries on object space.
  • Drill down to pixel space or to cube.
  • Many queries are spatial need HTM or ..

21
Demo
  • Show pixel space and object space explorers.

22
A Simple Schema
Photo
Spectro
23
How to Design the Database?
  • Decide what it is for 20 questions approach has
    worked well
  • Design it to answer those 20 questions
  • Iterate (it is easy to change designs).
  • BUT.. Be careful about names
  • reddening ? extinction causes problemsfuzzy
    definitions cause problemsdocumenting what a
    value means is hard

24
The Answer is 42
  • But what is the accuracy and precision?
  • What is the derivation?
  • Needs a man page

25
The SDSS Experience
  • DB has worked out well
  • Tools are very important (especially data
    loading)
  • Integration with web servers/services is very
    important
  • Need more than single-node parallelism
  • Need better query plans
  • But overall a success.
  • Have been able to clone it for several other
    datasets (FIRST, 2MASS, SSS, INT)
  • Database replicated at many sites (25?)
  • Built an interesting data-ingest system.

26
Traffic Analysis
  • SDSS DR1 has been online for a while.
  • Peak hour is 12M records/hour
  • Peak query is 500,000 rows (limit)

27
The Future
  • Things will get better.
  • Code is moving into the DB
  • easier to add spatial and other functionsbetter
    performanceNo Inside/Outside dichotomy
  • XML Schema (XSD) describes data on the wire.
  • I love DataSets (an schematized network of
    records )
  • XSD described
  • collections of record sets
  • With foreign keys
  • With updategrams
  • XML and xQuery is comingThis may help some
    things This may confuse things (more
    choices)Probably both.
Write a Comment
User Comments (0)
About PowerShow.com