Analyzing Large Datasets in Astrophysics - PowerPoint PPT Presentation

About This Presentation
Title:

Analyzing Large Datasets in Astrophysics

Description:

Towards an International Virtual Observatory, Garching, 2002 (Living in an ... Upload dataset. Very fast spatial ... Clustering with Photo-z. w( ) by ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 28
Provided by: tams154
Category:

less

Transcript and Presenter's Notes

Title: Analyzing Large Datasets in Astrophysics


1
Analyzing Large Datasets in Astrophysics
Towards an International Virtual
Observatory,Garching, 2002
(Living in an exponential world.)
  • Alexander Szalay
  • The Johns Hopkins University

2
Outline
  • Collecting Data
  • Exponential Growth
  • Making Discoveries
  • Publishing Data
  • VO How will it work?
  • Web Services
  • Atomic vs Composite services
  • Distributed queries with SkyQuery
  • Cross-Matching Algorithm
  • SkyNode Web Services Portal
  • Statistical Analysis of large data sets

3
The World is Exponential
  • Astrophysical data is growing exponentially
  • Doubling every year (Moores Law)both data
    sizes and number of data sets
  • Computational resources scale the same way
  • Constant will keep up with the data
  • Main problem is the software component
  • Currently components are not reused
  • Software costs are increasingly larger fraction
  • Aggregate costs are growing exponentially

4
Making Discoveries
  • When and where are discoveries made?
  • Always at the edges and boundaries
  • Going deeper, using more colors.
  • Metcalfes law
  • Utility of computer networks grows as the number
    of possible connections O(N2)
  • VO Federation of N archives
  • Possibilities for new discoveries grow as O(N2)
  • Current sky surveys have proven this
  • Very early discoveries from SDSS, 2MASS, DPOSS

5
Publishing Data
Roles Authors Publishers Curators Consumers
Traditional Scientists Journals Libraries Scientis
ts
Emerging Collaborations Project www site Bigger
Archives Scientists
6
Changing Roles
  • Exponential growth
  • Projects last at least 3-5 years
  • Data sent upwards only at the end of the project
  • Data will be never centralized
  • More responsibility on projects
  • Becoming Publishers and Curators
  • Larger fraction of budget spent on software
  • Lot of development duplicated, wasted
  • More standards are needed
  • Easier data interchange, fewer tools
  • More templates are needed
  • Develop less software on your own

7
Emerging New Concepts
  • Standardizing distributed data
  • Web Services, supported on all platforms
  • Custom configure remote data dynamically
  • XML Extensible Markup Language
  • SOAP Simple Object Access Protocol
  • WSDL Web Services Description Language
  • Standardizing distributed computing
  • Grid Services
  • Custom configure remote computing dynamically
  • Build your own remote computer, and discard
  • Virtual Data new data sets on demand

8
NVO How Will It Work?
  • Define commonly used atomic services
  • Build higher level toolboxes/portals on top
  • We do not build everything for everybody
  • Use the 90-10 rule
  • Define the standards and interfaces
  • Build the framework
  • Build the 10 of services that are used by 90
  • Let the users build the rest from the components

9
Atomic Services
  • Metadata information about resources
  • Waveband
  • Sky coverage
  • Translation of names to universal dictionary
    (UCD)
  • Simple search patterns on the resources
  • Cone Search
  • Image mosaic
  • Unit conversions
  • Simple filtering, counting, histogramming
  • On-the-fly recalibrations

10
Higher Level Services
  • Built on Atomic Services
  • Perform more complex tasks
  • Examples
  • Automated resource discovery
  • Cross-identifications
  • Photometric redshifts
  • Outlier detections
  • Visualization facilities
  • Expectation
  • Build custom portals in matter of days from
    existing building blocks (like today in IRAF or
    IDL)

11
SkyQuery
  • Distributed Query tool using a set of services
  • Feasibility study, built in 6 weeks from scratch
  • Tanu Malik (JHU CS grad student)
  • Tamas Budavari (JHU astro postdoc)
  • Implemented in C and .NET
  • Won 2nd prize of Microsoft XML Contest
  • Allows queries like

SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
12
Architecture
Web Page
Image cutout
SkyQuery
SkyNodeSDSS
SkyNode2Mass
SkyNodeFirst
13
Cross-id Steps
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND
AREA(181.3,-0.76,6.5) AND (o.i - t.m_j) gt
2 AND o.type3
  • Parse query
  • Get counts
  • Sort by counts
  • Make plan
  • Cross-match
  • Recursively, from small to large
  • Select necessary attributes only
  • Return output
  • Insert cutout image

14
Monte-Carlo Simulation
  • Comparing different algorithms for 3-way xid
  • Transmit all the data
  • Transmit after filtering
  • Recursive cross-match
  • Surveys
  • SDSS
  • 2MASS
  • First
  • Random variables
  • Sky Area (0..10 sqdeg)
  • Selectivity of each subselect (0..1)
  • Efficiency of join (0.5..2)
  • Selectivity of common select (0..1)

15
SkyNode
  • Metadata functions (SOAP)
  • Info, Tables, Columns, Schema, Functions,
    Keysearch
  • Query functions (SOAP)
  • Dataset Query(String sqlCmd)
  • Dataset Xmatch(Dataset input, String sqlCmd,
    float eps)
  • Database
  • MS SQL Server
  • Upload dataset
  • Very fast spatial search engine
    (HTM-based)crossmatch takes lt3 ms/object over
    15M in SDSS
  • User defined functions and stored procedures

16
Data Flow
query
http//www.skyquery.net
17
Optimal Statistics
  • The examples for optimal statistics have poor
    scaling
  • Correlation functions N2, likelihood techniques
    N3
  • As data sizes grow at Moores law, computers can
    only keep up with at most N logN algorithms
  • What goes?
  • Notion of optimal is in the sense of statistical
    errors
  • Assumes infinite computational resources
  • Assumes that only source of error is statistical
  • Cosmic Variance we can only observe the
    Universe from one location (finite sample size)
  • Solutions require combination of Statistics and
    CS
  • New algorithms not worse than N logN

18
Clever Data Structures
  • Heavy use of tree structures
  • Up-front cost, but only N logN
  • Large speedup later
  • Tree-codes for correlations (A. Moore et al 2001)
  • Fast, approximate heuristic algorithms
  • No need to be more accurate than cosmic variance
  • Fast CMB analysis by Szapudi etal (2001)
  • N logN instead of N3 gt 1 day instead of 10
    million years
  • Take cost of computation into account
  • Controlled level of accuracy
  • Best result in a given time, given our computing
    resources

19
Angular Clustering with Photo-z
  • w(?) by Peebles and Groth
  • The first example of publishing and analyzing
    large data
  • Samples based on rest-frame quantities
  • Strictly volume limited samples
  • Largest angular correlation study to date
  • Very clear detection of
  • Luminosity and color dependence
  • Results consistent with 3D clustering

T. Budavari, A. Connolly, I. Csabai, I. Szapudi,
A. Szalay, S. Dodelson, J. Frieman, R. Scranton,
D. Johnston and the SDSS Collaboration
20
The Samples
2800 square degrees in 10 stripes, data in custom
DB
All 50M
mrlt21 15M
10 stripes 10M
0.1ltzlt0.3 -20 gt Mr 2.2M
0.1ltzlt0.5 -21.4 gt Mr 3.1M
-20 gt Mr gt-21 1182k
-21 gt Mr gt-23 931k
-21 gt Mr gt-22 662k
-22 gt Mr gt-23 269k
21
The Stripes
  • 10 stripes over the SDSS area, covering about
    2800 square degrees
  • About 20 lost due to bad seeing
  • Masks seeing, bright stars, etc.
  • Images generated from query by web service

22
The Masks
  • Stripe 11 masks
  • Masks are derived from the database
  • Search and intersect extended objects with
    boundaries

23
The Analysis
  • eSpICE I.Szapudi, S.Colombi and S.Prunet
  • Integrated with the database by T. Budavari
  • Extremely fast processing (N logN)
  • 1 stripe with about 1 million galaxies is
    processed in 3 mins
  • Usual figure was 10 min for 10,000 galaxies gt 70
    days
  • Each stripe processed separately for each cut
  • 2D angular correlation function computed
  • w(?) average with rejection of pixels along the
    scan
  • flat field vector causes mock correlations

24
Angular Correlations I.
  • Luminosity dependence 3 cuts
  • -20gt M gt -21
  • -21gt M gt -22
  • -22gt M gt -23

25
Angular Correlations II.
  • Color Dependence
  • 4 bins by rest-frame SED type

26
Summary
  • Exponential data growth distributed data
  • Web Services hierarchical architecture
  • Use the 90-10 rule (maybe 80-20)
  • There are clever ways to federate datasets!
  • Statistical analyses do not follow Moores law
  • Need to revisit optimal statistics
  • Give interesting new tools into the hands of
    smart young people
  • They will quickly turn them into cutting edge
    science

27
Virtual Observatory
Astronomy with an attitude
Write a Comment
User Comments (0)
About PowerShow.com