Title: Analyzing Large Datasets in Astrophysics
1Analyzing Large Datasets in Astrophysics
Towards an International Virtual
Observatory,Garching, 2002
(Living in an exponential world.)
- Alexander Szalay
- The Johns Hopkins University
2Outline
- Collecting Data
- Exponential Growth
- Making Discoveries
- Publishing Data
- VO How will it work?
- Web Services
- Atomic vs Composite services
- Distributed queries with SkyQuery
- Cross-Matching Algorithm
- SkyNode Web Services Portal
- Statistical Analysis of large data sets
3The World is Exponential
- Astrophysical data is growing exponentially
- Doubling every year (Moores Law)both data
sizes and number of data sets - Computational resources scale the same way
- Constant will keep up with the data
- Main problem is the software component
- Currently components are not reused
- Software costs are increasingly larger fraction
- Aggregate costs are growing exponentially
4Making Discoveries
- When and where are discoveries made?
- Always at the edges and boundaries
- Going deeper, using more colors.
- Metcalfes law
- Utility of computer networks grows as the number
of possible connections O(N2) - VO Federation of N archives
- Possibilities for new discoveries grow as O(N2)
- Current sky surveys have proven this
- Very early discoveries from SDSS, 2MASS, DPOSS
5Publishing Data
Roles Authors Publishers Curators Consumers
Traditional Scientists Journals Libraries Scientis
ts
Emerging Collaborations Project www site Bigger
Archives Scientists
6Changing Roles
- Exponential growth
- Projects last at least 3-5 years
- Data sent upwards only at the end of the project
- Data will be never centralized
- More responsibility on projects
- Becoming Publishers and Curators
- Larger fraction of budget spent on software
- Lot of development duplicated, wasted
- More standards are needed
- Easier data interchange, fewer tools
- More templates are needed
- Develop less software on your own
7Emerging New Concepts
- Standardizing distributed data
- Web Services, supported on all platforms
- Custom configure remote data dynamically
- XML Extensible Markup Language
- SOAP Simple Object Access Protocol
- WSDL Web Services Description Language
- Standardizing distributed computing
- Grid Services
- Custom configure remote computing dynamically
- Build your own remote computer, and discard
- Virtual Data new data sets on demand
8NVO How Will It Work?
- Define commonly used atomic services
- Build higher level toolboxes/portals on top
- We do not build everything for everybody
- Use the 90-10 rule
- Define the standards and interfaces
- Build the framework
- Build the 10 of services that are used by 90
- Let the users build the rest from the components
9Atomic Services
- Metadata information about resources
- Waveband
- Sky coverage
- Translation of names to universal dictionary
(UCD) - Simple search patterns on the resources
- Cone Search
- Image mosaic
- Unit conversions
- Simple filtering, counting, histogramming
- On-the-fly recalibrations
10Higher Level Services
- Built on Atomic Services
- Perform more complex tasks
- Examples
- Automated resource discovery
- Cross-identifications
- Photometric redshifts
- Outlier detections
- Visualization facilities
- Expectation
- Build custom portals in matter of days from
existing building blocks (like today in IRAF or
IDL)
11SkyQuery
- Distributed Query tool using a set of services
- Feasibility study, built in 6 weeks from scratch
- Tanu Malik (JHU CS grad student)
- Tamas Budavari (JHU astro postdoc)
- Implemented in C and .NET
- Won 2nd prize of Microsoft XML Contest
- Allows queries like
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
12Architecture
Web Page
Image cutout
SkyQuery
SkyNodeSDSS
SkyNode2Mass
SkyNodeFirst
13Cross-id Steps
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND
AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) gt
2 AND o.type3
- Parse query
- Get counts
- Sort by counts
- Make plan
- Cross-match
- Recursively, from small to large
- Select necessary attributes only
- Return output
- Insert cutout image
14Monte-Carlo Simulation
- Comparing different algorithms for 3-way xid
- Transmit all the data
- Transmit after filtering
- Recursive cross-match
- Surveys
- SDSS
- 2MASS
- First
- Random variables
- Sky Area (0..10 sqdeg)
- Selectivity of each subselect (0..1)
- Efficiency of join (0.5..2)
- Selectivity of common select (0..1)
15SkyNode
- Metadata functions (SOAP)
- Info, Tables, Columns, Schema, Functions,
Keysearch - Query functions (SOAP)
- Dataset Query(String sqlCmd)
- Dataset Xmatch(Dataset input, String sqlCmd,
float eps) - Database
- MS SQL Server
- Upload dataset
- Very fast spatial search engine
(HTM-based)crossmatch takes lt3 ms/object over
15M in SDSS - User defined functions and stored procedures
16Data Flow
query
http//www.skyquery.net
17Optimal Statistics
- The examples for optimal statistics have poor
scaling - Correlation functions N2, likelihood techniques
N3 - As data sizes grow at Moores law, computers can
only keep up with at most N logN algorithms - What goes?
- Notion of optimal is in the sense of statistical
errors - Assumes infinite computational resources
- Assumes that only source of error is statistical
- Cosmic Variance we can only observe the
Universe from one location (finite sample size) - Solutions require combination of Statistics and
CS - New algorithms not worse than N logN
18Clever Data Structures
- Heavy use of tree structures
- Up-front cost, but only N logN
- Large speedup later
- Tree-codes for correlations (A. Moore et al 2001)
- Fast, approximate heuristic algorithms
- No need to be more accurate than cosmic variance
- Fast CMB analysis by Szapudi etal (2001)
- N logN instead of N3 gt 1 day instead of 10
million years - Take cost of computation into account
- Controlled level of accuracy
- Best result in a given time, given our computing
resources
19Angular Clustering with Photo-z
- w(?) by Peebles and Groth
- The first example of publishing and analyzing
large data - Samples based on rest-frame quantities
- Strictly volume limited samples
- Largest angular correlation study to date
- Very clear detection of
- Luminosity and color dependence
- Results consistent with 3D clustering
T. Budavari, A. Connolly, I. Csabai, I. Szapudi,
A. Szalay, S. Dodelson, J. Frieman, R. Scranton,
D. Johnston and the SDSS Collaboration
20The Samples
2800 square degrees in 10 stripes, data in custom
DB
All 50M
mrlt21 15M
10 stripes 10M
0.1ltzlt0.3 -20 gt Mr 2.2M
0.1ltzlt0.5 -21.4 gt Mr 3.1M
-20 gt Mr gt-21 1182k
-21 gt Mr gt-23 931k
-21 gt Mr gt-22 662k
-22 gt Mr gt-23 269k
21The Stripes
- 10 stripes over the SDSS area, covering about
2800 square degrees - About 20 lost due to bad seeing
- Masks seeing, bright stars, etc.
- Images generated from query by web service
22The Masks
- Stripe 11 masks
- Masks are derived from the database
- Search and intersect extended objects with
boundaries
23The Analysis
- eSpICE I.Szapudi, S.Colombi and S.Prunet
- Integrated with the database by T. Budavari
- Extremely fast processing (N logN)
- 1 stripe with about 1 million galaxies is
processed in 3 mins - Usual figure was 10 min for 10,000 galaxies gt 70
days - Each stripe processed separately for each cut
- 2D angular correlation function computed
- w(?) average with rejection of pixels along the
scan - flat field vector causes mock correlations
24Angular Correlations I.
- Luminosity dependence 3 cuts
- -20gt M gt -21
- -21gt M gt -22
- -22gt M gt -23
25Angular Correlations II.
- Color Dependence
- 4 bins by rest-frame SED type
26Summary
- Exponential data growth distributed data
- Web Services hierarchical architecture
- Use the 90-10 rule (maybe 80-20)
- There are clever ways to federate datasets!
- Statistical analyses do not follow Moores law
- Need to revisit optimal statistics
- Give interesting new tools into the hands of
smart young people - They will quickly turn them into cutting edge
science
27Virtual Observatory
Astronomy with an attitude