Title: The World-Wide Telescope: Astronomy with Terabytes
1The World-Wide Telescope Astronomy with
Terabytes
- Alex SzalayThe Johns Hopkins University
- Jim Gray Microsoft Research
2Outline
- Challenges A New Kind of Science
- Publishing the Data
- Data Federation
- The Virtual Observatory
- Analyzing Terabytes of Data
3Living in an Exponential World
- Astronomers have a few hundred TB now
- 1 pixel (byte) / sq arc second 4TB
- Multi-spectral, temporal, ? 1PB
- They mine it looking for new (kinds of) objects
or more of interesting ones (quasars),
density variations in 400-D space correlations
in 400-D space - Data doubles every year
- Data is public after 1 year
- So, 50 of the data is public
- Same access for everyone
4The Challenges
Exponential data growth Distributed
collections Soon Petabytes
Data Collection
Discovery and Analysis
Publishing
New analysis paradigm Data federations,
Move analysis to data
New publishing paradigm Scientists are
publishers and Curators
5Evolving Science
- Thousand years ago science was empirical
- describing natural phenomena
- Last few hundred years theoretical branch
- using models, generalizations
- Last few decades a computational branch
- simulating complex phenomena
- Today data exploration (eScience)
- synthesizing theory, experiment and
computation with advanced data management and
statistics
6Outline
- Challenges A New Kind of Science
- Publishing the Data
- Data Federation
- The Virtual Observatory
- Analyzing Terabytes of Data
7Publishing Data
- Exponential growth
- Projects last at least 3-5 years
- Data sent upwards only at the end of the project
- Data will never be centralized
- More responsibility on projects
- Becoming Publishers and Curators
- Data will reside with projects
- Analyses must be close to the data
8Data Access is Hitting a Wall
FTP and GREP are not adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years
- Oh!, and 1PB 4,000 disks
- At some point you need indices to limit
search parallel data search and analysis - This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min ( 1 /GB)
- 2 days and 1K
- 3 years and 1M
9Analysis and Databases
- Much statistical analysis deals with
- Creating uniform samples
- data filtering
- Assembling relevant subsets
- Estimating completeness
- censoring bad data
- Counting and building histograms
- Generating Monte-Carlo subsets
- Likelihood calculations
- Hypothesis testing
- Traditionally these are performed on files
- Most of these tasks are much better done inside a
database - Move Mohamed to the mountain, not the mountain to
Mohamed.
10Smart Data
- If there is too much data to move around,
- take the analysis to the data!
- Do all data manipulations at database
- Build custom procedures and functions in the
database - Automatic parallelism guaranteed
- Easy to build-in custom functionality
- Databases Procedures being unified
- Example temporal and spatial indexing
- Pixel processing
- Easy to reorganize the data
- Multiple views, each optimal for certain analyses
- Building hierarchical summaries are trivial
- Scalable to Petabyte datasets
-
active databases!
11Outline
- Challenges A New Kind of Science
- Publishing the Data
- Data Federation
- The Virtual Observatory
- Analyzing Terabytes of Data
12Making Discoveries
- Where are discoveries made?
- At the edges and boundaries
- Going deeper, collecting more data, using more
colors. - Metcalfes law
- Utility of computer networks grows as the number
of possible connections O(N2) - Federating data
- Federation of N archives has utility O(N2)
- Possibilities for new discoveries grow as O(N2)
- Current sky surveys have proven this
- Very early discoveries from SDSS, 2MASS, DPOSS
13Data Federations
- Massive datasets live near their owners
- Near the instruments software pipeline
- Near the applications
- Near data knowledge and curation
- Super Computer centers become Super Data Centers
- Each Archive publishes (web) services
- Schema documents the data
- Methods on objects (queries)
- Scientists get personalized extracts
- Uniform access to multiple Archives
- A common global schema
Federation
14The Key Web Services
- Web SERVER
- Given a url parameters
- Returns a web page (often dynamic)
- Web SERVICE
- Given a XML document (soap msg)
- Returns an XML document
- Tools make this look like an RPC.
- F(x,y,z) returns (u, v, w)
- Distributed objects for the web.
- naming, discovery, security,..
- Internet-scale distributed computing
Your program
Web Server
http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
NVO WESIX service build your object catalog in 5
mins
15SkyQuery (http//skyquery.net/)
- Distributed Query tool using a set of web
services - Many astronomy archives from Pasadena, Chicago,
Baltimore, Cambridge (England). - Feasibility study, built in 6 weeks
- Tanu Malik (JHU CS grad student)
- Tamas Budavari (JHU astro postdoc)
- With help from Szalay, Thakar, Gray
- Implemented in C and .NET
- Allows queries like
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
Now http//openskyquery.net/
16SkyQuery Structure
- Each SkyNode publishes
- Schema Web Service
- Database Web Service
- Portal is
- Plans Query (2 phase)
- Integrates answers
- Is itself a web service
17Outline
- Challenges A New Kind of Science
- Publishing the Data
- Data Federation
- The Virtual Observatory
- Analyzing Terabytes of Data
18Why Is Astronomy Special?
- Especially attractive for the wide public
- It has no commercial value
- No privacy concerns, freely share results with
others - Great for experimenting with algorithms
- It is real and well documented
- High-dimensional (with confidence intervals)
- Spatial, temporal
- Diverse and distributed
- Many different instruments from many different
places and many different times - The questions are interesting
- There is a lot of it (soon petabytes)
19The Virtual Observatory
- Premise most data is (or could be online)
- So, the Internet is the worlds best telescope
- It has data on every part of the sky
- In every measured spectral band optical, x-ray,
radio.. - As deep as the best instruments (2 years ago).
- It is up when you are up
- The seeing is always great
- Its a smart telescope links objects and
data to literature on them - Software became the capital expense
- Share, standardize, reuse..
- It has to be SIMPLE
20National Virtual Observatory
- NSF ITR project, Building the Framework for the
National Virtual Observatory is a collaboration
of 17 funded and 3 unfunded organizations - Astronomy data centers
- National observatories
- Supercomputer centers
- University departments
- Computer science/information technology
specialists - Natural cohesion with Grid Computing
http//us-vo.org/
21International Collaboration
- Similar efforts now in 15 countries
- USA, UK, Canada, France, Germany, Italy, Holland,
Japan, Australia, India, China, Russia, Hungary,
South Korea, ESO, Spain - Total awarded funding world-wide is over 60M
- Active collaboration among projects
- Standards, common demos
- International VO roadmap being developed
- Regular telecons over 10 timezones
- Formal collaboration
- International Virtual Observatory Alliance (IVOA)
22Boundary Conditions
- Standards driven by evolving new technologies
- Exchange of rich and structured data (XML)
- DB connectivity, Web Services, Grid computing
- Application to astronomy domain
- Data dictionaries (UCDs)
- Data models
- Protocols
- Registries and resource/service discovery
- Provenance, data quality, DATA CURATION!!!!
Boundary conditions
- Dealing with the astronomy legacy
- FITS data format
- Software systems
23Main VO Challenges
- How to avoid trying to be everything for
everybody? - Database connectivity is essential
- Bring the analysis to the data
- Core web services, higher level applications on
top - Use the 90-10 rule
- Define the standards and interfaces
- Build the framework
- Build the 10 of services that are used by 90
- Let the users build the rest from the components
- Rapidly changing outside world
- Make it simple!!!
24NVO from research to services
- First two years spent on
- Team building
- Defining the standards
- Building prototypes and pilot studies
- Get feedback from astronomy SW community
- Third year
- Define core applications
- Prototypes to services
- Build them
- Document them
25First Light
- Jan 2005 the first real applications
- Taking some of the most common tasks
- Discovery and data access
- Analysis and exploration
- Visualization
- Immediate future
- engage the whole astronomy community
26OpenSkyQuery
Cross-match your data with numerous catalogs
OpenSkyQuery allows you to cross-match
astronomical catalogs and select subsets of
catalogs with a general and powerful query
language. You can also import a personal catalog
of objects and cross-match it against selected
databases.
27Spectrum Services
Search, plot, and retrieve SDSS, 2dF, and other
spectra
The Spectrum Services web site is dedicated to
spectrum related VO services. On this site you
will find tools and tutorials on how to access
close to 500,000 spectra from the Sloan Digital
Sky Survey (SDSS DR1) and the 2 degree Field
redshift survey (2dFGRS). The services are open
to everyone to publish their own spectra in the
same framework. Reading the tutorials on XML Web
Services, you can learn how to integrate the 45
GB spectrum and passband database with your
programs with few lines of code.
28Web Enabled Source Identification with
Cross-Matching (WESIX)
Upload images to SExtractor and cross-correlate
the objects found with selected survey catalogs.
This NVO service does source extraction and
cross-matching for any astrometric FITS image.
The user uploads a FITS image, and the remote
service runs the SExtractor software for source
extraction. The resulting catalog can be
cross-matched with any of several major surveys,
and the results returned as a VOTable. The web
page also allows use of Aladin or VOPlot to
visualize results.
29SkyServer
- Sloan Digital Sky Survey Pixels Objects
- About 500 attributes per object, 300M objects
- Spectra for 1M objects
- Currently 2TB fully public
- Prototype eScience lab
- Moving analysis to the data
- Fast searches color, spatial
- Visual tools
- Join pixels with objects
- Prototype in data publishing
- 70 million web hits in 3.5 years
- http//skyserver.sdss.org/
30DB Loading
- Automated table driven workflow system for
loading - Included lots of verification code
- Over 16K lines of SQL code
- Loading process was extremely painful
- Lack of systems engineering for the pipelines
- Poor testing (lots of foreign key mismatch)
- Detected data bugs even a month ago
- Most of the time spent on scrubbing data
- Fixing corrupted files (RAID5 disk errors)
- Once data is clean, everything loads in 1 week
- Neighbors calculation took about 10 hours
- Reorganization of data took about 1 week of
experiments in partitioning/layouts
31Data Delivery
- Small requests (lt100MB)
- Putting data on the stream
- Medium requests (lt1GB)
- Use DIME attachments to SOAP messages
- Large requests (gt1GB)
- Save data in scratch area and use asynch delivery
- Only practical for large/long queries
- Iterative requests
- Save data in temp tables in user space
- Let user manipulate via web browser
- Paradox if we use web browser to submit, users
want immediate response from batch-size queries
32Queue Management
- Need to register batch power users
- Query output goes to MyDB
- Can be joined with source database
- Results are materialized from MyDB upon request
- Users can do
- Insert, Drop, Create, Select Into, Functions,
Procedures - Publish their tables to a group area
- Data delivery via the CASJobs (C WS)
33Spatial Features
- Precomputed Neighbors
- All objects within 30
- Boundaries, Masks and Outlines
- 27,000 spatial objects
- Stored as spatial polygons
- Time Domain
- Precomputed Match
- All objects with 1, observed at different times
- Found duplicates due to telescope tracking errors
- Manual fix, recorded in the database
- MatchHead
- The first observation of the linked list used as
unique id to chain of observations of the same
object
34Things Can Get Complex
353 Ways To Do Spatial
- Hierarchical Triangular Mesh (extension to SQL)
- Uses table valued stored procedures
- Acts as a new spatial access method
- Ported to Yukon CLR for a 17x speedup.
- Zones fits SQL well
- Surprisingly simple good on a fixed scale
- Constraints a novel idea
- Lets us do algebra on regions., implemented in
pure SQL - PaperThere Goes the Neighborhood Relational
Algebra for Spatial Data Search
36Footprint Poster Child App
- Used as footprint service.
- Take many footprints
- Fuzz them (buffer) to make coarser
footprintconvex hull of vertices - See if two footprints overlap
- 20 lines of code 130 lines of logic/comments
37CrossMatch Zone Approach
- Divide space into zones
- Key points by Zone, offset(on the sphere this
need wrap-around margin.) - Point search look in a few zones at a limited
offset ra ? a bounding box that has
1-p/4 false positives - All inside the relational engine
- Avoids impedance mismatch
- Can batch all-all comparisons
- faster and 60,000x parallel1 hours, not 6
months! - This is Maria Nieto Santistebans PhD thesis
r
ra-zoneMax
x
v(r2(ra-zoneMax)2) cos(radians(zoneMax))
zoneMax
Ra x
38Zones allow 60,000 Parallel Jobs Partition
Parallelism 3.7 hours
2MASSUSNOBZoneZoneComparison
MergeAnswer
Build Index
Source Tables
Zoned Tables
2MASS?USNOB
350 Mrec 12 GB
2MASS 471 Mrec 140 GB
0-1
64 Mrec 2 GB
2MASS 471 Mrec 65 GB
00
260 Mrec 9 GB
USNOB 1.1 Brec 233 GB
USNOB 1.1 Brec 106 GB
350 Mrec 12 GB
01
26 Mrec 1 GB
USNOB?2MASS
2 hours
1.2 hour
.5 hour
39Pipeline Parallelism 2.5 hours Or as fast as we
can read USNOB .5 hours
2MASSUSNOBZoneZoneComparison
MergeAnswer
Build Index
Source Tables
Zones
2MASS?USNOB
350 Mrec 12 GB
2MASS 471 Mrec 140 GB
0-1
64 Mrec 2 GB
Next zone
00
260 Mrec 9 GB
USNOB 1.1 Brec 233 GB
Next zone
350 Mrec 12 GB
01
26 Mrec 1 GB
USNOB?2MASS
2 hours
.5 hour
40Outline
- Challenges A New Kind of Science
- Publishing the Data
- Data Federation
- The Virtual Observatory
- Analyzing Terabytes of Data
41Next-Generation Data Analysis
- Looking for
- Needles in haystacks the Higgs particle
- Haystacks Dark matter, Dark energy
- Needles are easier than haystacks
- Optimal statistics have poor scaling
- Correlation functions are N2, likelihood
techniques N3 - For large data sets main errors are not
statistical - As data and computers grow with Moores Law, we
can only keep up with N logN - A way out?
- Discard notion of optimal (data is fuzzy, answers
are approximate) - Dont assume infinite computational resources or
memory - Requires combination of statistics computer
science
42Organization Algorithms
- Use of clever data structures (trees, cubes)
- Up-front creation cost, but only N logN access
cost - Large speedup during the analysis
- Tree-codes for correlations (A. Moore et al 2001)
- Data Cubes for OLAP (all vendors)
- Fast, approximate heuristic algorithms
- No need to be more accurate than cosmic variance
- Fast CMB analysis by Szapudi et al (2001)
- N logN instead of N3 gt 1 day instead of 10
million years - Take cost of computation into account
- Controlled level of accuracy
- Best result in a given time, given our computing
resources
43Trends
- CMB Surveys
- 1990 COBE 1000
- 2000 Boomerang 10,000
- 2002 CBI 50,000
- 2003 WMAP 1 Million
- 2008 Planck 10 Million
- Angular Galaxy Surveys
- 1970 Lick 1M
- 1990 APM 2M
- 2005 SDSS 200M
- 2008 VISTA 1000M
- 2012 LSST 3000M
- Galaxy Redshift Surveys
- 1986 CfA 3500
- 1996 LCRS 23000
- 2003 2dF 250000
- 2005 SDSS 750000
- Time Domain
- QUEST
- SDSS Extension survey
- Dark Energy Camera
- PanStarrs
- SNAP
- LSST
Petabytes/year by the end of the decade
44Summary
- Data growing exponentially
- Publishing so much data requires a new model
- Multiple challenges for different communities
- publishing, visualization, statistics,
algorithms, educational - Information at your fingertips
- Students see the same data as professional
astronomers - More data coming Petabytes/year by 2010
- Need scalable solutions
- Move analysis to the data!
- Same thing happening in all sciences
- High energy physics, genomics, cancer
research,medical imaging, oceanography, remote
sensing, - eScience an emerging new branch of science