Publishing Large Astronomical Data Sets: The Virtual Observatory - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Publishing Large Astronomical Data Sets: The Virtual Observatory

Description:

Publishing Large Astronomical Data Sets: The Virtual Observatory ... FTP and GREP are not adequate. You can GREP 1 MB in a second. You can GREP 1 GB in a minute ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 19
Provided by: alex271
Category:

less

Transcript and Presenter's Notes

Title: Publishing Large Astronomical Data Sets: The Virtual Observatory


1
Publishing Large Astronomical Data SetsThe
Virtual Observatory
  • Alex Szalay, Johns Hopkins Jim Gray, Microsoft
    Research

2
Living in an Exponential World
  • Astronomers have a few hundred TB now
  • 1 pixel (byte) / sq arc second 4TB
  • Multi-spectral, temporal, ? 1PB
  • They mine it looking for new (kinds of) objects
    or more of interesting ones (quasars),
    density variations in 400-D space correlations
    in 400-D space
  • Data doubles every year
  • Data is public after 1 year
  • So, 50 of the data is public
  • Some have private access to 5 more data
  • Same access for everyone

3
Science is hitting a wall
  • FTP and GREP are not adequate
  • You can GREP 1 MB in a second
  • You can GREP 1 GB in a minute
  • You can GREP 1 TB in 2 days
  • You can GREP 1 PB in 3 years.
  • Oh!, and 1PB 10,000 disks
  • At some point you need indices to limit
    search parallel data search and analysis
  • This is where databases can help
  • You can FTP 1 MB in 1 sec
  • You can FTP 1 GB / min ( 1 /GB)
  • 2 days and 1K
  • 3 years and 1M

4
Making Discoveries
  • When and where are discoveries made?
  • Always at the edges and boundaries
  • Going deeper, using more colors.
  • Metcalfes law
  • Utility of computer networks grows as the number
    of possible connections O(N2)
  • VO Federation of N archives
  • Possibilities for new discoveries grow as O(N2)
  • Current sky surveys have proven this
  • Very early discoveries from SDSS, 2MASS,
    DPOSSbrown dwarfs, high redshift quasars

5
Publishing Data
Roles Authors Publishers Curators Consumers
Traditional Scientists Journals Libraries Scientis
ts
Emerging Collaborations Project www site Bigger
Archives Scientists
6
Changing Roles
  • Exponential growth
  • Projects last at least 3-5 years
  • Data sent upwards only at the end of the project
  • Data will be never centralized
  • More responsibility on projects
  • Becoming Publishers and Curators
  • Larger fraction of budget spent on software
  • Lot of development duplicated, wasted
  • All documentation is contained in the archive
  • More standards are needed
  • Easier data interchange, fewer tools
  • More templates are needed
  • Develop less software on your own

7
Emerging New Concepts
  • Standardizing distributed data
  • Web Services, supported on all platforms
  • XML Extensible Markup Language
  • SOAP Simple Object Access Protocol
  • WSDL Web Services Description Language
  • Standardizing distributed computing
  • Grid Services
  • Build your own remote computer, and discard
  • Move your analysis closer to the data
  • Virtual Data new data sets on demand
  • Process and fetch the data via web services

8
Features of the SDSS
Goal Create the most detailed map of
the Northern sky in 5 years 2.5m telescope,
Apache Point, NM 3 degree field of view ¼
of the whole sky Two surveys in one
Photometric survey in 5 bands Spectroscopic
redshift survey Automated data reduction 150
man-years of development Very high data volume
40 TB of raw data 5 TB processed catalogs
Data is public
The University of Chicago Princeton
University The Johns Hopkins University The
University of Washington New Mexico State
University Fermi National Accelerator
Laboratory US Naval Observatory The
Japanese Participation Group The Institute for
Advanced Study Max Planck Inst, Heidelberg
Sloan Foundation, NSF, DOE, NASA
9
The Imaging Survey
  • Continuous data rate of 8 Mbytes/sec
  • drift scan of 10,000 square degrees
  • 24k x 1M pixel panoramic images
  • 5 colors broad-band filters (u,g,r,i,z)
  • exposure time 55 sec

10
The Spectroscopic Survey
Elliptical galaxy
Expanding universe redshift distance gt
3D map SDSS Redshift Survey 1 million
galaxies 100,000 quasars 100,000 stars Two
spectrographs 640 spectra simultaneously Features
Automated reduction Very high sampling density
and completeness
11
Data Flow
12
SkyServer
  • Based on the TerraServer design
  • Designed for high school students
  • Contains 150 hours of interactive courses
  • Experiment for easy visual interfaces
  • Opened June 5, 2001
  • After 2 years
  • 12M page hits
  • Now 1M/mo
  • Added Web Services
  • Cutout, SkyQuery

http//skyserver.sdss.org/
13
Public Data Release
  • June 2000 EDR
  • Early Data Release
  • July 2003 DR1
  • Contains 30 of final data
  • 200 million photo objects
  • 4 versions of the data
  • Target, best, runs, spectro
  • Total catalog volume 1.7TB
  • See Terascale sneakernet paper
  • Published releases served forever
  • EDR, DR1, DR2, .
  • Soon to include email archives, annotations
  • O(N2) only possible because of Moores Law!

EDR
14
Why Is Astronomy Data Special?
  • It has no commercial value
  • No privacy concerns
  • Can freely share results with others
  • Great for experimenting with algorithms
  • It is real and well documented
  • High-dimensional (with confidence intervals)
  • Spatial
  • Temporal
  • Diverse and distributed
  • Many different instruments from many different
    places and many different times
  • The questions are interesting
  • There is a lot of it (soon petabytes)

15
Virtual Observatory
  • Many new surveys are coming
  • SDSS is a dry run for the next ones
  • LSST will be 5TB/night
  • All the data will be on the Internet
  • But how? ftp, webservice
  • Data and apps will be associated withthe
    instruments
  • Distributed world wide
  • Cross-indexed
  • Federation is a must, but how?
  • Will be the best telescope in the world
  • World Wide Telescope

16
National Virtual Observatory
  • NSF ITR project, Building the Framework for the
    National Virtual Observatory is a collaboration
    of 17 funded and 3 unfunded organizations
  • Astronomy data centers
  • National observatories
  • Supercomputer centers
  • University departments
  • Computer science/information technology
    specialists
  • PI and project director Alex Szalay (JHU)
  • CoPI Roy Williams (Caltech/CACR)
  • Project Manager Bob Hanisch (STScI)

17
International Collaboration
  • Similar efforts now in more than 10 countries
  • USA, Canada, UK, France, Germany, Italy, Japan,
    Australia, India, China, Russia
  • Total awarded funding world-wide is over 60M
  • Active collaboration among projects
  • Standards, common demos
  • International VO roadmap being developed
  • Regular telecons over 10 timezones
  • Formal collaboration
  • International Virtual Observatory Alliance (IVOA)

18
Summary
  • Publishing so much data requires a new model
  • Multiple challenges for different communities
  • publishing, data mining, data visualization,
    educational, web services poster-child
  • Information at your fingertips
  • Students see the same data as professional
    astronomers
  • More data coming petabytes/year by 2010
  • We need scalable solutions
  • Same thing happening in all sciences
  • High energy physics, genomics, cancer
    research,medical imaging, oceanography, remote
    sensing,
  • Data Exploration an emerging new branch of
    science
Write a Comment
User Comments (0)
About PowerShow.com