The World-Wide Telescope: Astronomy with Terabytes

About This Presentation

Title:

The World-Wide Telescope: Astronomy with Terabytes

Description:

You can GREP 1 PB in 3 years. Oh!, and 1PB ~4,000 disks. At some ... FTP and GREP are not adequate. Analysis and Databases. Much statistical analysis deals with ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 45

Provided by: alex269

Category:

more less

Transcript and Presenter's Notes

Title: The World-Wide Telescope: Astronomy with Terabytes

1
The World-Wide Telescope Astronomy with
Terabytes

Alex SzalayThe Johns Hopkins University
Jim Gray Microsoft Research

2
Outline

Challenges A New Kind of Science
Publishing the Data
Data Federation
The Virtual Observatory
Analyzing Terabytes of Data

3
Living in an Exponential World

Astronomers have a few hundred TB now
1 pixel (byte) / sq arc second 4TB
Multi-spectral, temporal, ? 1PB
They mine it looking for new (kinds of) objects
or more of interesting ones (quasars),
density variations in 400-D space correlations
in 400-D space
Data doubles every year
Data is public after 1 year
So, 50 of the data is public
Same access for everyone

4
The Challenges
Exponential data growth Distributed
collections Soon Petabytes
Data Collection
Discovery and Analysis
Publishing
New analysis paradigm Data federations,
Move analysis to data
New publishing paradigm Scientists are
publishers and Curators
5
Evolving Science

Thousand years ago science was empirical
describing natural phenomena
Last few hundred years theoretical branch
using models, generalizations
Last few decades a computational branch
simulating complex phenomena
Today data exploration (eScience)
synthesizing theory, experiment and
computation with advanced data management and
statistics

6
Outline

Challenges A New Kind of Science
Publishing the Data
Data Federation
The Virtual Observatory
Analyzing Terabytes of Data

7
Publishing Data

Exponential growth
Projects last at least 3-5 years
Data sent upwards only at the end of the project
Data will never be centralized
More responsibility on projects
Becoming Publishers and Curators
Data will reside with projects
Analyses must be close to the data

8
Data Access is Hitting a Wall
FTP and GREP are not adequate

You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years
Oh!, and 1PB 4,000 disks
At some point you need indices to limit
search parallel data search and analysis
This is where databases can help

You can FTP 1 MB in 1 sec
You can FTP 1 GB / min ( 1 /GB)
2 days and 1K
3 years and 1M

9
Analysis and Databases

Much statistical analysis deals with
Creating uniform samples
data filtering
Assembling relevant subsets
Estimating completeness
censoring bad data
Counting and building histograms
Generating Monte-Carlo subsets
Likelihood calculations
Hypothesis testing
Traditionally these are performed on files
Most of these tasks are much better done inside a
database
Move Mohamed to the mountain, not the mountain to
Mohamed.

10
Smart Data

If there is too much data to move around,
take the analysis to the data!
Do all data manipulations at database
Build custom procedures and functions in the
database
Automatic parallelism guaranteed
Easy to build-in custom functionality
Databases Procedures being unified
Example temporal and spatial indexing
Pixel processing
Easy to reorganize the data
Multiple views, each optimal for certain analyses
Building hierarchical summaries are trivial
Scalable to Petabyte datasets

active databases!
11
Outline

Challenges A New Kind of Science
Publishing the Data
Data Federation
The Virtual Observatory
Analyzing Terabytes of Data

12
Making Discoveries

Where are discoveries made?
At the edges and boundaries
Going deeper, collecting more data, using more
colors.
Metcalfes law
Utility of computer networks grows as the number
of possible connections O(N2)
Federating data
Federation of N archives has utility O(N2)
Possibilities for new discoveries grow as O(N2)
Current sky surveys have proven this
Very early discoveries from SDSS, 2MASS, DPOSS

13
Data Federations

Massive datasets live near their owners
Near the instruments software pipeline
Near the applications
Near data knowledge and curation
Super Computer centers become Super Data Centers
Each Archive publishes (web) services
Schema documents the data
Methods on objects (queries)
Scientists get personalized extracts
Uniform access to multiple Archives
A common global schema

Federation
14
The Key Web Services

Web SERVER
Given a url parameters
Returns a web page (often dynamic)
Web SERVICE
Given a XML document (soap msg)
Returns an XML document
Tools make this look like an RPC.
F(x,y,z) returns (u, v, w)
Distributed objects for the web.
naming, discovery, security,..
Internet-scale distributed computing

Your program
Web Server
http
Web page
Your program
Web Service
soap
Data In your address space
objectin xml
NVO WESIX service build your object catalog in 5
mins
15
SkyQuery (http//skyquery.net/)

Distributed Query tool using a set of web
services
Many astronomy archives from Pasadena, Chicago,
Baltimore, Cambridge (England).
Feasibility study, built in 6 weeks
Tanu Malik (JHU CS grad student)
Tamas Budavari (JHU astro postdoc)
With help from Szalay, Thakar, Gray
Implemented in C and .NET
Allows queries like

SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
Now http//openskyquery.net/
16
SkyQuery Structure

Each SkyNode publishes
Schema Web Service
Database Web Service

Portal is
Plans Query (2 phase)
Integrates answers
Is itself a web service

17
Outline

Challenges A New Kind of Science
Publishing the Data
Data Federation
The Virtual Observatory
Analyzing Terabytes of Data

18
Why Is Astronomy Special?

Especially attractive for the wide public
It has no commercial value
No privacy concerns, freely share results with
others
Great for experimenting with algorithms
It is real and well documented
High-dimensional (with confidence intervals)
Spatial, temporal
Diverse and distributed
Many different instruments from many different
places and many different times
The questions are interesting
There is a lot of it (soon petabytes)

19
The Virtual Observatory

Premise most data is (or could be online)
So, the Internet is the worlds best telescope
It has data on every part of the sky
In every measured spectral band optical, x-ray,
radio..
As deep as the best instruments (2 years ago).
It is up when you are up
The seeing is always great
Its a smart telescope links objects and
data to literature on them
Software became the capital expense
Share, standardize, reuse..
It has to be SIMPLE

20
National Virtual Observatory

NSF ITR project, Building the Framework for the
National Virtual Observatory is a collaboration
of 17 funded and 3 unfunded organizations
Astronomy data centers
National observatories
Supercomputer centers
University departments
Computer science/information technology
specialists
Natural cohesion with Grid Computing

http//us-vo.org/
21
International Collaboration

Similar efforts now in 15 countries
USA, UK, Canada, France, Germany, Italy, Holland,
Japan, Australia, India, China, Russia, Hungary,
South Korea, ESO, Spain
Total awarded funding world-wide is over 60M
Active collaboration among projects
Standards, common demos
International VO roadmap being developed
Regular telecons over 10 timezones
Formal collaboration
International Virtual Observatory Alliance (IVOA)

22
Boundary Conditions

Standards driven by evolving new technologies
Exchange of rich and structured data (XML)
DB connectivity, Web Services, Grid computing

Application to astronomy domain
Data dictionaries (UCDs)
Data models
Protocols
Registries and resource/service discovery
Provenance, data quality, DATA CURATION!!!!

Boundary conditions

Dealing with the astronomy legacy
FITS data format
Software systems

23
Main VO Challenges

How to avoid trying to be everything for
everybody?
Database connectivity is essential
Bring the analysis to the data
Core web services, higher level applications on
top
Use the 90-10 rule
Define the standards and interfaces
Build the framework
Build the 10 of services that are used by 90
Let the users build the rest from the components
Rapidly changing outside world
Make it simple!!!

24
NVO from research to services

First two years spent on
Team building
Defining the standards
Building prototypes and pilot studies
Get feedback from astronomy SW community
Third year
Define core applications
Prototypes to services
Build them
Document them

25
First Light

Jan 2005 the first real applications
Taking some of the most common tasks
Discovery and data access
Analysis and exploration
Visualization
Immediate future
engage the whole astronomy community

26
OpenSkyQuery
Cross-match your data with numerous catalogs
OpenSkyQuery allows you to cross-match
astronomical catalogs and select subsets of
catalogs with a general and powerful query
language. You can also import a personal catalog
of objects and cross-match it against selected
databases.
27
Spectrum Services
Search, plot, and retrieve SDSS, 2dF, and other
spectra
The Spectrum Services web site is dedicated to
spectrum related VO services. On this site you
will find tools and tutorials on how to access
close to 500,000 spectra from the Sloan Digital
Sky Survey (SDSS DR1) and the 2 degree Field
redshift survey (2dFGRS). The services are open
to everyone to publish their own spectra in the
same framework. Reading the tutorials on XML Web
Services, you can learn how to integrate the 45
GB spectrum and passband database with your
programs with few lines of code.
28
Web Enabled Source Identification with
Cross-Matching (WESIX)
Upload images to SExtractor and cross-correlate
the objects found with selected survey catalogs.
This NVO service does source extraction and
cross-matching for any astrometric FITS image.
The user uploads a FITS image, and the remote
service runs the SExtractor software for source
extraction. The resulting catalog can be
cross-matched with any of several major surveys,
and the results returned as a VOTable. The web
page also allows use of Aladin or VOPlot to
visualize results.
29
SkyServer

Sloan Digital Sky Survey Pixels Objects
About 500 attributes per object, 300M objects
Spectra for 1M objects
Currently 2TB fully public
Prototype eScience lab
Moving analysis to the data
Fast searches color, spatial
Visual tools
Join pixels with objects
Prototype in data publishing
70 million web hits in 3.5 years
http//skyserver.sdss.org/

30
DB Loading

Automated table driven workflow system for
loading
Included lots of verification code
Over 16K lines of SQL code
Loading process was extremely painful
Lack of systems engineering for the pipelines
Poor testing (lots of foreign key mismatch)
Detected data bugs even a month ago
Most of the time spent on scrubbing data
Fixing corrupted files (RAID5 disk errors)
Once data is clean, everything loads in 1 week
Neighbors calculation took about 10 hours
Reorganization of data took about 1 week of
experiments in partitioning/layouts

31
Data Delivery

Small requests (lt100MB)
Putting data on the stream
Medium requests (lt1GB)
Use DIME attachments to SOAP messages
Large requests (gt1GB)
Save data in scratch area and use asynch delivery
Only practical for large/long queries
Iterative requests
Save data in temp tables in user space
Let user manipulate via web browser
Paradox if we use web browser to submit, users
want immediate response from batch-size queries

32
Queue Management

Need to register batch power users
Query output goes to MyDB
Can be joined with source database
Results are materialized from MyDB upon request
Users can do
Insert, Drop, Create, Select Into, Functions,
Procedures
Publish their tables to a group area
Data delivery via the CASJobs (C WS)

33
Spatial Features

Precomputed Neighbors
All objects within 30
Boundaries, Masks and Outlines
27,000 spatial objects
Stored as spatial polygons
Time Domain
Precomputed Match
All objects with 1, observed at different times
Found duplicates due to telescope tracking errors
Manual fix, recorded in the database
MatchHead
The first observation of the linked list used as
unique id to chain of observations of the same
object

34
Things Can Get Complex
35
3 Ways To Do Spatial

Hierarchical Triangular Mesh (extension to SQL)
Uses table valued stored procedures
Acts as a new spatial access method
Ported to Yukon CLR for a 17x speedup.
Zones fits SQL well
Surprisingly simple good on a fixed scale
Constraints a novel idea
Lets us do algebra on regions., implemented in
pure SQL
PaperThere Goes the Neighborhood Relational
Algebra for Spatial Data Search

36
Footprint Poster Child App

Used as footprint service.
Take many footprints
Fuzz them (buffer) to make coarser
footprintconvex hull of vertices
See if two footprints overlap
20 lines of code 130 lines of logic/comments

37
CrossMatch Zone Approach

Divide space into zones
Key points by Zone, offset(on the sphere this
need wrap-around margin.)
Point search look in a few zones at a limited
offset ra ? a bounding box that has
1-p/4 false positives
All inside the relational engine
Avoids impedance mismatch
Can batch all-all comparisons
faster and 60,000x parallel1 hours, not 6
months!
This is Maria Nieto Santistebans PhD thesis

r
ra-zoneMax
x
v(r2(ra-zoneMax)2) cos(radians(zoneMax))
zoneMax
Ra x
38
Zones allow 60,000 Parallel Jobs Partition
Parallelism 3.7 hours
2MASSUSNOBZoneZoneComparison
MergeAnswer
Build Index
Source Tables
Zoned Tables
2MASS?USNOB
350 Mrec 12 GB
2MASS 471 Mrec 140 GB
0-1
64 Mrec 2 GB
2MASS 471 Mrec 65 GB
00
260 Mrec 9 GB
USNOB 1.1 Brec 233 GB
USNOB 1.1 Brec 106 GB
350 Mrec 12 GB
01
26 Mrec 1 GB
USNOB?2MASS
2 hours
1.2 hour
.5 hour
39
Pipeline Parallelism 2.5 hours Or as fast as we
can read USNOB .5 hours
2MASSUSNOBZoneZoneComparison
MergeAnswer
Build Index
Source Tables
Zones
2MASS?USNOB
350 Mrec 12 GB
2MASS 471 Mrec 140 GB
0-1
64 Mrec 2 GB
Next zone
00
260 Mrec 9 GB
USNOB 1.1 Brec 233 GB
Next zone
350 Mrec 12 GB
01
26 Mrec 1 GB
USNOB?2MASS
2 hours
.5 hour
40
Outline

Challenges A New Kind of Science
Publishing the Data
Data Federation
The Virtual Observatory
Analyzing Terabytes of Data

41
Next-Generation Data Analysis

Looking for
Needles in haystacks the Higgs particle
Haystacks Dark matter, Dark energy
Needles are easier than haystacks
Optimal statistics have poor scaling
Correlation functions are N2, likelihood
techniques N3
For large data sets main errors are not
statistical
As data and computers grow with Moores Law, we
can only keep up with N logN
A way out?
Discard notion of optimal (data is fuzzy, answers
are approximate)
Dont assume infinite computational resources or
memory
Requires combination of statistics computer
science

42
Organization Algorithms

Use of clever data structures (trees, cubes)
Up-front creation cost, but only N logN access
cost
Large speedup during the analysis
Tree-codes for correlations (A. Moore et al 2001)
Data Cubes for OLAP (all vendors)
Fast, approximate heuristic algorithms
No need to be more accurate than cosmic variance
Fast CMB analysis by Szapudi et al (2001)
N logN instead of N3 gt 1 day instead of 10
million years
Take cost of computation into account
Controlled level of accuracy
Best result in a given time, given our computing
resources

43
Trends

CMB Surveys
1990 COBE 1000
2000 Boomerang 10,000
2002 CBI 50,000
2003 WMAP 1 Million
2008 Planck 10 Million

Angular Galaxy Surveys
1970 Lick 1M
1990 APM 2M
2005 SDSS 200M
2008 VISTA 1000M
2012 LSST 3000M

Galaxy Redshift Surveys
1986 CfA 3500
1996 LCRS 23000
2003 2dF 250000
2005 SDSS 750000

Time Domain
QUEST
SDSS Extension survey
Dark Energy Camera
PanStarrs
SNAP
LSST

Petabytes/year by the end of the decade
44
Summary

Data growing exponentially
Publishing so much data requires a new model
Multiple challenges for different communities
publishing, visualization, statistics,
algorithms, educational
Information at your fingertips
Students see the same data as professional
astronomers
More data coming Petabytes/year by 2010
Need scalable solutions
Move analysis to the data!
Same thing happening in all sciences
High energy physics, genomics, cancer
research,medical imaging, oceanography, remote
sensing,
eScience an emerging new branch of science