Analyzing Large Datasets in Astrophysics - PowerPoint PPT Presentation

About This Presentation

Title:

Analyzing Large Datasets in Astrophysics

Description:

Towards an International Virtual Observatory, Garching, 2002 (Living in an ... Upload dataset. Very fast spatial ... Clustering with Photo-z. w( ) by ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 28

Provided by: tams154

Category:

more less

Transcript and Presenter's Notes

Title: Analyzing Large Datasets in Astrophysics

1
Analyzing Large Datasets in Astrophysics
Towards an International Virtual
Observatory,Garching, 2002
(Living in an exponential world.)

Alexander Szalay
The Johns Hopkins University

2
Outline

Collecting Data
Exponential Growth
Making Discoveries
Publishing Data
VO How will it work?
Web Services
Atomic vs Composite services
Distributed queries with SkyQuery
Cross-Matching Algorithm
SkyNode Web Services Portal
Statistical Analysis of large data sets

3
The World is Exponential

Astrophysical data is growing exponentially
Doubling every year (Moores Law)both data
sizes and number of data sets
Computational resources scale the same way
Constant will keep up with the data
Main problem is the software component
Currently components are not reused
Software costs are increasingly larger fraction
Aggregate costs are growing exponentially

4
Making Discoveries

When and where are discoveries made?
Always at the edges and boundaries
Going deeper, using more colors.
Metcalfes law
Utility of computer networks grows as the number
of possible connections O(N2)
VO Federation of N archives
Possibilities for new discoveries grow as O(N2)
Current sky surveys have proven this
Very early discoveries from SDSS, 2MASS, DPOSS

5
Publishing Data
Roles Authors Publishers Curators Consumers
Traditional Scientists Journals Libraries Scientis
ts
Emerging Collaborations Project www site Bigger
Archives Scientists
6
Changing Roles

Exponential growth
Projects last at least 3-5 years
Data sent upwards only at the end of the project
Data will be never centralized
More responsibility on projects
Becoming Publishers and Curators
Larger fraction of budget spent on software
Lot of development duplicated, wasted
More standards are needed
Easier data interchange, fewer tools
More templates are needed
Develop less software on your own

7
Emerging New Concepts

Standardizing distributed data
Web Services, supported on all platforms
Custom configure remote data dynamically
XML Extensible Markup Language
SOAP Simple Object Access Protocol
WSDL Web Services Description Language
Standardizing distributed computing
Grid Services
Custom configure remote computing dynamically
Build your own remote computer, and discard
Virtual Data new data sets on demand

8
NVO How Will It Work?

Define commonly used atomic services
Build higher level toolboxes/portals on top
We do not build everything for everybody
Use the 90-10 rule
Define the standards and interfaces
Build the framework
Build the 10 of services that are used by 90
Let the users build the rest from the components

9
Atomic Services

Metadata information about resources
Waveband
Sky coverage
Translation of names to universal dictionary
(UCD)
Simple search patterns on the resources
Cone Search
Image mosaic
Unit conversions
Simple filtering, counting, histogramming
On-the-fly recalibrations

10
Higher Level Services

Built on Atomic Services
Perform more complex tasks
Examples
Automated resource discovery
Cross-identifications
Photometric redshifts
Outlier detections
Visualization facilities
Expectation
Build custom portals in matter of days from
existing building blocks (like today in IRAF or
IDL)

11
SkyQuery

Distributed Query tool using a set of services
Feasibility study, built in 6 weeks from scratch
Tanu Malik (JHU CS grad student)
Tamas Budavari (JHU astro postdoc)
Implemented in C and .NET
Won 2nd prize of Microsoft XML Contest
Allows queries like

SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
12
Architecture
Web Page
Image cutout
SkyQuery
SkyNodeSDSS
SkyNode2Mass
SkyNodeFirst
13
Cross-id Steps
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND
AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) gt
2 AND o.type3

Parse query
Get counts
Sort by counts
Make plan
Cross-match
Recursively, from small to large
Select necessary attributes only
Return output
Insert cutout image

14
Monte-Carlo Simulation

Comparing different algorithms for 3-way xid
Transmit all the data
Transmit after filtering
Recursive cross-match
Surveys
SDSS
2MASS
First
Random variables
Sky Area (0..10 sqdeg)
Selectivity of each subselect (0..1)
Efficiency of join (0.5..2)
Selectivity of common select (0..1)

15
SkyNode

Metadata functions (SOAP)
Info, Tables, Columns, Schema, Functions,
Keysearch
Query functions (SOAP)
Dataset Query(String sqlCmd)
Dataset Xmatch(Dataset input, String sqlCmd,
float eps)
Database
MS SQL Server
Upload dataset
Very fast spatial search engine
(HTM-based)crossmatch takes lt3 ms/object over
15M in SDSS
User defined functions and stored procedures

16
Data Flow
query
http//www.skyquery.net
17
Optimal Statistics

The examples for optimal statistics have poor
scaling
Correlation functions N2, likelihood techniques
N3
As data sizes grow at Moores law, computers can
only keep up with at most N logN algorithms
What goes?
Notion of optimal is in the sense of statistical
errors
Assumes infinite computational resources
Assumes that only source of error is statistical
Cosmic Variance we can only observe the
Universe from one location (finite sample size)
Solutions require combination of Statistics and
CS
New algorithms not worse than N logN

18
Clever Data Structures

Heavy use of tree structures
Up-front cost, but only N logN
Large speedup later
Tree-codes for correlations (A. Moore et al 2001)
Fast, approximate heuristic algorithms
No need to be more accurate than cosmic variance
Fast CMB analysis by Szapudi etal (2001)
N logN instead of N3 gt 1 day instead of 10
million years
Take cost of computation into account
Controlled level of accuracy
Best result in a given time, given our computing
resources

19
Angular Clustering with Photo-z

w(?) by Peebles and Groth
The first example of publishing and analyzing
large data
Samples based on rest-frame quantities
Strictly volume limited samples
Largest angular correlation study to date
Very clear detection of
Luminosity and color dependence
Results consistent with 3D clustering

T. Budavari, A. Connolly, I. Csabai, I. Szapudi,
A. Szalay, S. Dodelson, J. Frieman, R. Scranton,
D. Johnston and the SDSS Collaboration
20
The Samples
2800 square degrees in 10 stripes, data in custom
DB
All 50M
mrlt21 15M
10 stripes 10M
0.1ltzlt0.3 -20 gt Mr 2.2M
0.1ltzlt0.5 -21.4 gt Mr 3.1M
-20 gt Mr gt-21 1182k
-21 gt Mr gt-23 931k
-21 gt Mr gt-22 662k
-22 gt Mr gt-23 269k
21
The Stripes