Title: Publishing Large Astronomical Data Sets: The Virtual Observatory
1Publishing Large Astronomical Data SetsThe
Virtual Observatory
- Alex Szalay, Johns Hopkins Jim Gray, Microsoft
Research
2Living in an Exponential World
- Astronomers have a few hundred TB now
- 1 pixel (byte) / sq arc second 4TB
- Multi-spectral, temporal, ? 1PB
- They mine it looking for new (kinds of) objects
or more of interesting ones (quasars),
density variations in 400-D space correlations
in 400-D space - Data doubles every year
- Data is public after 1 year
- So, 50 of the data is public
- Some have private access to 5 more data
- Same access for everyone
3Science is hitting a wall
- FTP and GREP are not adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 10,000 disks
- At some point you need indices to limit
search parallel data search and analysis - This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min ( 1 /GB)
- 2 days and 1K
- 3 years and 1M
4Making Discoveries
- When and where are discoveries made?
- Always at the edges and boundaries
- Going deeper, using more colors.
- Metcalfes law
- Utility of computer networks grows as the number
of possible connections O(N2) - VO Federation of N archives
- Possibilities for new discoveries grow as O(N2)
- Current sky surveys have proven this
- Very early discoveries from SDSS, 2MASS,
DPOSSbrown dwarfs, high redshift quasars
5Publishing Data
Roles Authors Publishers Curators Consumers
Traditional Scientists Journals Libraries Scientis
ts
Emerging Collaborations Project www site Bigger
Archives Scientists
6Changing Roles
- Exponential growth
- Projects last at least 3-5 years
- Data sent upwards only at the end of the project
- Data will be never centralized
- More responsibility on projects
- Becoming Publishers and Curators
- Larger fraction of budget spent on software
- Lot of development duplicated, wasted
- All documentation is contained in the archive
- More standards are needed
- Easier data interchange, fewer tools
- More templates are needed
- Develop less software on your own
7Emerging New Concepts
- Standardizing distributed data
- Web Services, supported on all platforms
- XML Extensible Markup Language
- SOAP Simple Object Access Protocol
- WSDL Web Services Description Language
- Standardizing distributed computing
- Grid Services
- Build your own remote computer, and discard
- Move your analysis closer to the data
- Virtual Data new data sets on demand
- Process and fetch the data via web services
8Features of the SDSS
Goal Create the most detailed map of
the Northern sky in 5 years 2.5m telescope,
Apache Point, NM 3 degree field of view ¼
of the whole sky Two surveys in one
Photometric survey in 5 bands Spectroscopic
redshift survey Automated data reduction 150
man-years of development Very high data volume
40 TB of raw data 5 TB processed catalogs
Data is public
The University of Chicago Princeton
University The Johns Hopkins University The
University of Washington New Mexico State
University Fermi National Accelerator
Laboratory US Naval Observatory The
Japanese Participation Group The Institute for
Advanced Study Max Planck Inst, Heidelberg
Sloan Foundation, NSF, DOE, NASA
9The Imaging Survey
- Continuous data rate of 8 Mbytes/sec
- drift scan of 10,000 square degrees
- 24k x 1M pixel panoramic images
- 5 colors broad-band filters (u,g,r,i,z)
- exposure time 55 sec
10The Spectroscopic Survey
Elliptical galaxy
Expanding universe redshift distance gt
3D map SDSS Redshift Survey 1 million
galaxies 100,000 quasars 100,000 stars Two
spectrographs 640 spectra simultaneously Features
Automated reduction Very high sampling density
and completeness
11Data Flow
12SkyServer
- Based on the TerraServer design
- Designed for high school students
- Contains 150 hours of interactive courses
- Experiment for easy visual interfaces
- Opened June 5, 2001
- After 2 years
- 12M page hits
- Now 1M/mo
- Added Web Services
- Cutout, SkyQuery
http//skyserver.sdss.org/
13Public Data Release
- June 2000 EDR
- Early Data Release
- July 2003 DR1
- Contains 30 of final data
- 200 million photo objects
- 4 versions of the data
- Target, best, runs, spectro
- Total catalog volume 1.7TB
- See Terascale sneakernet paper
- Published releases served forever
- EDR, DR1, DR2, .
- Soon to include email archives, annotations
- O(N2) only possible because of Moores Law!
EDR
14Why Is Astronomy Data Special?
- It has no commercial value
- No privacy concerns
- Can freely share results with others
- Great for experimenting with algorithms
- It is real and well documented
- High-dimensional (with confidence intervals)
- Spatial
- Temporal
- Diverse and distributed
- Many different instruments from many different
places and many different times - The questions are interesting
- There is a lot of it (soon petabytes)
15Virtual Observatory
- Many new surveys are coming
- SDSS is a dry run for the next ones
- LSST will be 5TB/night
- All the data will be on the Internet
- But how? ftp, webservice
- Data and apps will be associated withthe
instruments - Distributed world wide
- Cross-indexed
- Federation is a must, but how?
- Will be the best telescope in the world
- World Wide Telescope
16National Virtual Observatory
- NSF ITR project, Building the Framework for the
National Virtual Observatory is a collaboration
of 17 funded and 3 unfunded organizations - Astronomy data centers
- National observatories
- Supercomputer centers
- University departments
- Computer science/information technology
specialists - PI and project director Alex Szalay (JHU)
- CoPI Roy Williams (Caltech/CACR)
- Project Manager Bob Hanisch (STScI)
17International Collaboration
- Similar efforts now in more than 10 countries
- USA, Canada, UK, France, Germany, Italy, Japan,
Australia, India, China, Russia - Total awarded funding world-wide is over 60M
- Active collaboration among projects
- Standards, common demos
- International VO roadmap being developed
- Regular telecons over 10 timezones
- Formal collaboration
- International Virtual Observatory Alliance (IVOA)
18Summary
- Publishing so much data requires a new model
- Multiple challenges for different communities
- publishing, data mining, data visualization,
educational, web services poster-child - Information at your fingertips
- Students see the same data as professional
astronomers - More data coming petabytes/year by 2010
- We need scalable solutions
- Same thing happening in all sciences
- High energy physics, genomics, cancer
research,medical imaging, oceanography, remote
sensing, - Data Exploration an emerging new branch of
science