Managing Data for the World Wide Telescope aka: The Virtual Observatory - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Managing Data for the World Wide Telescope aka: The Virtual Observatory

Description:

Pixar: 100 TB/Movie. New emphasis on informatics: Capturing, Organizing, ... The 'seeing' is always great (no working at night, no clouds no moons no. ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 29
Provided by: jimg178
Category:

less

Transcript and Presenter's Notes

Title: Managing Data for the World Wide Telescope aka: The Virtual Observatory


1
Managing Data for the World Wide Telescope aka
The Virtual Observatory
  • Jim Gray
  • Alex Szalay
  • SLAC Data Management Workshop

2
The Evolution of Science
  • Observational Science
  • Scientist gathers data by direct observation
  • Scientist analyzes data
  • Analytical Science
  • Scientist builds analytical model
  • Makes predictions.
  • Computational Science
  • Simulate analytical model
  • Validate model and makes predictions
  • Data Exploration Science Data captured by
    instrumentsOr data generated by simulator
  • Processed by software
  • Placed in a database / files
  • Scientist analyzes database / files

3
Information Avalanche
  • In science, industry, government,.
  • better observational instruments and
  • and, better simulations
  • producing a data avalanche
  • Examples
  • BaBar Grows 1TB/day 2/3 simulation Information
    1/3 observational Information
  • CERN LHC will generate 1GB/s .10 PB/y
  • VLBA (NRAO) generates 1GB/s today
  • Pixar 100 TB/Movie
  • New emphasis on informatics
  • Capturing, Organizing, Summarizing, Analyzing,
    Visualizing

Image courtesy C. Meneveau A. Szalay _at_ JHU
BaBar, Stanford
PE Gene Sequencer From http//www.genome.uci.edu/
Space Telescope
4
The Big Picture
Experiments Instruments
facts
questions
?
facts
Other Archives
answers
facts
Literature
facts
Simulations
The Big Problems
  • Data ingest
  • Managing a petabyte
  • Common schema
  • How to organize it?
  • How to reorganize it
  • How to coexist with others
  • Query and Vis tools
  • Support/training
  • Performance
  • Execute queries in a minute
  • Batch query scheduling

5
FTP - GREP
  • Download (FTP and GREP) are not adequate
  • You can GREP 1 MB in a second
  • You can GREP 1 GB in a minute
  • You can GREP 1 TB in 2 days
  • You can GREP 1 PB in 3 years.
  • Oh!, and 1PB 3,000 disks
  • At some point we need indices to limit
    search parallel data search and analysis
  • This is where databases can help
  • Next generation technique Data Exploration
  • Bring the analysis to the data!

6
The Speed Problem
  • Many users want to search the whole DBad hoc
    queries, often combinatorial
  • Want 1 minute response
  • Brute force (parallel search)
  • 1 disk 50MBps gt 1M disks/PB 300M/PB
  • Indices (limit search, do column store)
  • 1,000x less equipment 1M/PB
  • Pre-compute answer
  • No one knows how do it for all questions.

7
Next-Generation Data Analysis
  • Looking for
  • Needles in haystacks the Higgs particle
  • Haystacks Dark matter, Dark energy
  • Needles are easier than haystacks
  • Global statistics have poor scaling
  • Correlation functions are N2, likelihood
    techniques N3
  • As data and computers grow at same rate, we can
    only keep up with N logN
  • A way out?
  • Relax notion of optimal (data is fuzzy, answers
    are approximate)
  • Dont assume infinite computational resources or
    memory
  • Combination of statistics computer science

8
Analysis and Databases
  • Much statistical analysis deals with
  • Creating uniform samples
  • data filtering
  • Assembling relevant subsets
  • Estimating completeness
  • censoring bad data
  • Counting and building histograms
  • Generating Monte-Carlo subsets
  • Likelihood calculations
  • Hypothesis testing
  • Traditionally these are performed on files
  • Most of these tasks are much better done inside a
    database
  • Move Mohamed to the mountain, not the mountain to
    Mohamed.

9
Organization Algorithms
  • Use of clever data structures (trees, cubes)
  • Up-front creation cost, but only N logN access
    cost
  • Large speedup during the analysis
  • Tree-codes for correlations (A. Moore et al 2001)
  • Data Cubes for OLAP (all vendors)
  • Fast, approximate heuristic algorithms
  • No need to be more accurate than cosmic variance
  • Fast CMB analysis by Szapudi et al (2001)
  • N logN instead of N3 gt 1 day instead of 10
    million years
  • Take cost of computation into account
  • Controlled level of accuracy
  • Best result in a given time, given our computing
    resources

10
World Wide TelescopeVirtual Observatoryhttp//w
ww.ivoa.net/
  • Premise Most data is (or could be online)
  • The Internet is the worlds best telescope
  • It has data on every part of the sky
  • In every measured spectral band optical, x-ray,
    radio..
  • As deep as the best instruments (2 years ago).
  • It is up when you are up.The seeing is always
    great (no working at night, no clouds no moons
    no..).
  • Its a smart telescope links objects and
    data to literature on them.

11
Why Astronomy?
  • Community has lots of data
  • Data is real and well documented
  • High-dimensional (with confidence intervals)
  • Spatial, temporal
  • Diverse and distributed
  • Many different instruments from many different
    places and many different times
  • Community wants to share/cross compare
  • Can freely share data and algorithms.
  • DataMining, Not Data MINE!! Mark Ellisman,
    UCSD
  • They are well organized
  • Community is small and homogeneous
  • No commercial or privacy concerns
  • All the problems are technical or social.

12
The WWT Components
  • Data Sources
  • Literature
  • Archives
  • Unified Definitions
  • Units,
  • Semantics/Concepts/Metrics, Representations,
  • Provenance
  • Object model
  • Classes and methods
  • Portals

13
Data Sources
  • Literature online and cross indexed
  • Simbad, ADS, NED,http//simbad.u-strasbg.fr/Simb
    ad, http//adswww.harvard.edu/,
    http//nedwww.ipac.caltech.edu/
  • Many curated archives online
  • FIRST, DPOSS, 2MASS, USNO, IRAS, SDSS, VizeR,
  • Typically files with English meta-data and some
    programs
  • Groups, Researchers, Amateurs Publish
  • Datasets online in various formats
  • Data publications are ephemeral (may disappear)
  • Many have unknown provenance
  • Documentation varies some good and some none.

14
Unified Definitions
  • Universal Content Definitions http//vizier.u-str
    asbg.fr/doc/UCD.htx
  • Collated all table heads from all the literature
  • 100,000 terms reduced to 1,500
  • Rough consensus that this is the right thing.
  • Refinement in progress as people use UCDs
  • Defines
  • Units
  • gram, radian, second, janski...
  • Semantic Concepts / Metrics
  • Std error, Chi2 fit, magnitude, flux _at_ passband,
    velocity,

15
Provenance
  • Most data will be derived.
  • To do science, need to trace derived data back
    to source.
  • So programs and inputs must be registered.
  • Must be able to re-run them.
  • Example Space Telescope Calibrated Data
  • Run on demand
  • Can specify software version (to get old answers)
  • Scientific Data Provenance and Curation are
    largely unsolved problems (some ideas but no
    science).

16
Object Model
Your program
Web Server
  • General acceptance of XML
  • Recent acceptance of XML Schema (XSD over DTD)
  • Wait-and-See about SOAP/WSDL/
  • Web Services are just Corba with angle
    brackets.
  • FTP is good enough for me.
  • Personal opinion
  • Web Services are much more than Corba ltgt
  • Huge focus on interop
  • Huge focus on integrated tools
  • But the community says Show me!
  • Many technologists convinced, but not yet the
    astronomers

http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
17
Classes and Methods
  • First Class VO tablehttp//www.us-vo.org/VOTable
    /
  • Represents an answer set in XML
  • Defined by an XML Schema (XSD)
  • Metadata (in terms of UCDs)
  • Data representation (numbers and text)
  • First method
  • Cone Search Get objects in this cone
    http//voservices.org/cone/

18
Other Classes
  • Space-Time class
  • http//hea-www.harvard.edu/arots/nvometa/STCdoc.p
    df
  • Image Class (returns pixels)
  • SdssCutout
  • Simple Image Access Protocol http//bill.cacr.cal
    tech.edu/cfdocs/usvo-pubs/files/ACF8DE.pdf
  • HyperAtlashttp//bill.cacr.caltech.edu/usvo-pubs/
    files/hyperatlas.pdf
  • Spectral
  • Simple Spectral Access Protocol
  • 500K spectra available at http//voservices.net/
    wave
  • Query Services
  • ADQL and SkyNode http//skyservice.pha.jhu.edu/dev
    elop/vo/adql/
  • And http//SkyQuery.Net
  • Registry
  • see below

19
The Registry
  • UDDI seemed inappropriate
  • Complex
  • Irrelevant questions
  • Relevant questions missing
  • Evolved Dublin Core
  • Represent Datasets, Services, Portals
  • Needs to be machine readable
  • Federation (DNS model)
  • Push Pull register then harvest
  • http//www.ivoa.net/twiki/bin/view/IVOA/IvoaResReg

20
Demo
  • SkyServer
  • navigator showing cutout web service
  • List showing many calls and variant use.
  • SkyQuery
  • Show integration of various archives.
  • Explain spatial join xMatch operator.

21
SkyServer.SDSS.org
  • A modern Astronomy archive
  • Raw Pixel data lives in file servers
  • Catalog data (derived objects) lives in Database
  • Online query to any and all
  • Also used for education
  • 150 hours of online Astronomy
  • Implicitly teaches data analysis
  • Interesting things
  • Spatial data search
  • Client query interface via Java Applet
  • Query interface via Emacs
  • Popular
  • Cloned by other surveys (a template design)
  • Web services are core of it.

22
SkyQueryA Prototype WWT
  • Started with SDSS data and schema
  • Imported12 other datasets into that spine
    schema.(a day per dataset plus load time)
  • Unified them with a portal
  • Implicit spatial join among the datasets.
  • All built on Web Services
  • Pure XML
  • Pure SOAP
  • Used .NET toolkit

23
Federation SkyQuery.Net
  • Combine 4 archives initially
  • Added 9 more
  • Send query to portal, portal joins data from
    archives.
  • Problem want to do multi-step data analysis
    (not just single query).
  • Solution Allow personal databases on portal
  • Problem some queries are monsters
  • Solution batch schedule on portal server,
    Deposits answer in personal database.

24
SkyQuery Structure
  • Each SkyNode publishes
  • Schema Web Service
  • Database Web Service
  • Portal is
  • Plans Query (2 phase)
  • Integrates answers
  • Is a web service

25
MyDBhttp//skyservice.pha.jhu.edu/devel/casjobs/
  • Portal allows federation of data but
  • Intermediate results may be large.
  • Intermediate results feed into next analysis
    step.
  • Sending them back-and-forth to client is costly
    and sometimes infeasible.
  • Solution create a working DB for client at
    Portal MyDB

26
MyDBhttp//skyservice.pha.jhu.edu/devel/casjobs/
  • Anyone can create a personal DB at SkyServer
    portal.
  • It is about 100 MB
  • It is private
  • Simple queries done immediately
  • Complex queries done by batch scheduler
  • All queries can create/read/write MyDB tables
  • Very popular with serious users.
  • MyDB will be sharable with by a group.

27
Open SkyQuery
  • SkyQuery being adopted by AstroGrid as reference
    implementation for OGSA-DAI(Open Grid Services
    Architecture, Data Access and Integration).
  • SkyNode basic archive objecthttp//www.ivoa.net/t
    wiki/bin/view/IVOA/SkyNode
  • SkyQuery Language (VoQL) is evolving.http//www.i
    voa.net/twiki/bin/view/IVOA/IvoaVOQL

28
The WWT Components
  • Outline
  • Data Sources
  • Literature
  • Archives
  • Unified Definitions
  • Units,
  • Semantics/Concepts/Metrics, Representations,
  • Provenance
  • Object model
  • Classes and methods
  • Portals
  • WWT is a poster child for the Data Grid.
  • What we learned
  • Astro is a community of 10,000
  • Homogenous Cooperative
  • If you cant do it for Astro, do not bother with
    3M bio-info.
  • Agreement
  • Takes time
  • Takes endless meetings
  • Big problems are non-technical
  • Legacy is a big problem.
  • Plumbing and tools are thereBut
  • What is the object model?
  • What do you want to save?
  • How document provenance?
Write a Comment
User Comments (0)
About PowerShow.com