Web Services and the VO: Using SDSS DR1 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Web Services and the VO: Using SDSS DR1

1
Web Services and the VOUsing SDSS DR1

Alex Szalay, and Jim Gray
with
Tamas Budavari, Sam Carlisle, Vivek Haridas,
Nolan Li, Tanu Malik, Maria Nieto-Santisteban,
Wil OMullane, Ani Thakar

2
Changing Roles

Exponential growth
Projects last at least 3-5 years
Data sent upwards only at the end of the project
Data will be never centralized
More responsibility on projects
Becoming Publishers and Curators
Larger fraction of budget spent on software
Lot of development duplicated, wasted
More standards are needed
Easier data interchange, fewer tools
More templates are needed
Develop less software on your own

3
Standards and Interoperability

Standards driven by e-business requirements
Exchange of rich and structured data (XML)
DB connectivity, Web Services, Grid computing
Application to astronomy domain
Data dictionaries (UCDs)
Data models
Protocols
Registries and resource/service discovery
Provenance, data quality
Dealing with the astronomy legacy
FITS data format
Software systems

Boundary conditions
4
Virtual Observatory

Many new surveys are coming
SDSS is a dry run for the next ones
LSST will be 5TB/night
All the data will be on the Internet
But how? ftp, webservice
Data and apps will be associated withthe
instruments
Distributed world wide
Cross-indexed
Federation is a must, but how?
Will be the best telescope in the world
World Wide Telescope

5
SkyServerSkyServer.SDSS.orgor
Skyserver.Pha.Jhu.edu/DR1/

Sloan Digital Sky Survey Data Pixels Data
Mining
About 400 attributes per object
Spectrograms for 1 of objects
Demo pixel space record space set
space teaching

6
Show Cutout Web Service
7
SkyQuery (http//skyquery.net/)

Distributed Query tool using a set of web
services
Four astronomy archives from Pasadena, Chicago,
Baltimore, Cambridge (England).
Feasibility study, built in 6 weeks
Tanu Malik (JHU CS grad student)
Tamas Budavari (JHU astro postdoc)
With help from Szalay, Thakar, Gray
Implemented in C and .NET
Allows queries like

SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
8
SkyQuery Structure

Each SkyNode publishes
Schema Web Service
Database Web Service

Portal is
Plans Query (2 phase)
Integrates answers
Is itself a web service

9
National Virtual Observatory

NSF ITR project, Building the Framework for the
National Virtual Observatory is a collaboration
of 17 funded and 3 unfunded organizations
Astronomy data centers
National observatories
Supercomputer centers
University departments
Computer science/information technology
specialists
Trying to build standards, interfaces and
prototypes
Goal federate datasets, enable new
discoveries,make it easier to publish new data

10
International Collaboration

Similar efforts now in more than 12 countries
USA, Canada, UK, France, Germany, Italy, Japan,
Australia, India, China, Korea, Russia, South
Africa, Hungary,.
Active collaboration among projects
Standards, common demos
International VO roadmap being developed
Regular telecons over 10 timezones
Formal collaboration
International Virtual Observatory Alliance (IVOA)

11
NVO How Will It Work?

Huge pressure to build something useful today
We do not build everything for everybody
Use the 90-10 rule
Define the standards and interfaces
Build the framework
Build the 10 of services that are used by 90
Let the users build the rest from the components
Define commonly used core services
Build higher level toolboxes/portals on top

12
Core Services

Metadata information about resources
Waveband
Sky coverage
Translation of names to universal dictionary
(UCD)
Simple search patterns on the resources
Cone Search
Image mosaic
Unit conversions
Simple filtering, counting, histogramming
On-the-fly recalibrations

13
Higher Level Services

Built on Core Services
Perform more complex tasks
Examples
Automated resource discovery
Cross-identifications
Photometric redshifts
Outlier detections
Visualization facilities (connect pixels to
objects)
Expectation
Build custom portals in matter of days from
existing building blocks (like today in IRAF or
IDL)

14
Using SDSS DR1 as a Prototype

SDSS DR1 (Data Release1) is now publicly
available
http//skyserver.pha.jhu.edu/dr1/
About 1TB of catalog data
Using MS SQL Server 2000
Complex schema (72 Tables)
About 80 million photometric objects
Two versions (TARGET/BEST)
Automated documentation
Raw data at FNAL file server with URL access

15
DR1 SkyServer

Classic 2-node web server (20k total)
1TB database
1M hpm (180k peak day)
DSS load killing us12 m rows per hour downloads
Answer set size follows power law

Jim Gray Quick Performance Measure of SkyServer
Dr1 4 June 2003 1500 PST
16
Loading DR1

Automated table driven workflow system for
loading
Included lots of verification code
Over 16K lines of SQL code
Loading process was extremely painful
Lack of systems engineering for the pipelines
Poor testing (lots of foreign key mismatch)
Detected data bugs even a month ago
Most of the time spent on scrubbing data
Fixing corrupted files (RAID5 disk errors)
Once data was clean, everything loaded in 3 days
Neighbors calculation took about 10 hours
Reorganization of data took about 1 week of
experiments in partitioning/layouts

17
Public Data Release Versions

June 2000 EDR
Early Data Release
July 2003 DR1
Contains 30 of final data
200 million photo objects
4 versions of the data
Target, best, runs, spectro
Total catalog volume 1.7TB
See Terascale sneakernet paper
Published releases served forever
EDR, DR1, DR2, .
Soon to include email archives, annotations
O(N2) only possible because of Moores Law!

EDR
18
Organization and Reorganization

Introduced partitions and filegroups
Photo, Tag, Neighbors, Spectro, Frame, Other,
Profiles
Keep partitions under 100GB
Vertical partitioning tried and abandoned
Both partitioning and index build now table
driven
Stored procedures to create/drop indices at
various granularities
Tremendous improvement in performance when doing
this on a large memory machine (24GB)
Also much better performance afterwards
But this was nice but not hard

19
Spatial Features

Precomputed Neighbors
All objects within 30
Boundaries, Masks and Outlines
Stored as spatial polygons
Time Domain
Precomputed Match
All objects with 1, observed at different times
Found duplicates due to telescope tracking errors
Manual fix, recorded in the database
MatchHead
The first observation of the linked list used as
unique id to chain of observations of the same
object

20
Spatial Data Access SQL extension

Szalay, Gray, Kunszt, Fekete, OMullane, Brunner
http//www.sdss.jhu.edu/htm
Added Hierarchical Triangular Mesh (HTM)
table-valued functions for spatial joins
Every object has a 20-deep Mesh ID
Given a spatial definition,routine returns up to
10 covering triangles
Spatial query is then up to 10 range queries
Fast 10,000 triangles / second / cpu

21
Web Services in Progress

Registry
Harvesting and querying, discovery of new
services
Data Delivery
Query driven Queue management
MyDB, VOSpace, VOProfile minimize data movement
Graphics and visualization
Query driven vs interactive
Show spatial objects (Chart/Navi/List)
Footprint/intersect
It is a fractal
Cross-matching
SkyQuery and SkyNode
Ferris-wheel
Distributed vs parallel

22
Graphics/Visualization Tools

Density plot
Show densities of attributes as a function of sky
position
Chart/Navi/List
Tie together catalogs and pixels
Spectrum viewer
Display spectra of galaxies and stars drawn from
the database
Filter profiles
Catalog of astronomical filters (optical
bandpasses)
Mirage with VO connections
Linked multi-pane visualization tool (Bell Labs)
VO extensions built at JHU

23
Other Tools

Spatial
Cone Search
SkyNode
CrossMatch (SkyQuery)
Footprint

Information Management
Registry services
Name resolver
Cosmological calculator
CASService
MyDB
VOSpace
User Authentication

24
Registry Easy Clients

Just use SOAP toolkit (T. McGlynn J. Lee have
done Perl client).
Easy in Java
java org.apache.axis.wsdl.WSDL2Java
"http//skyservice.pha.jhu.edu/devel/registry/regi
stry.asmx?wsdl"
Gives set of Classes for accessing the service
Gives Classes for the XML which is returned (i.e.
SimpleResource)
Still need to write client like
RegistryLocator loc new RegistryLocator()
RegistrySoap reg loc.getRegistrySoap()
ArrayOfSimpleResource reses null
reses reg.queryRegistry(args0)
http//skyservice.pha.jhu.edu/devel/registry/index
.aspx

Demo
25
Archive Footprint

Footprint is a fractal
Result depends on context
all sky, degree scale, pixel scale
Translate to web services
Footprint()returns single region that contains
the archive
Intersection(region, tolerance) feed a region
and returns the intersection with archive
footprint
Contains(point) returns yes/no (maybe fuzzy) if
point is inside archive footprint

26
Cross-Matching

SkyQuery SkyNode
Currently lots of proprietary features
Data transmitted via .NET DataSet gt VOTable
Query plan written in MS T-SQL gt ADQL
Spatial operator restricted to a cone
gtVORegion
Made up metadata delivery gt VORegistry
Data delivery in XML/HTML gt VOTable
Catalogs in the near future
SDSS DR1, FIRST, 2MASS, INT
POSS-1, GSC-2, HST, ROSAT, 2dF
GALEX, IRAS, PSCZ

27
Spatial Cross-Match

For small area HTM is close to optimal, but needs
more speed
For all-sky surveys the zone algorithm is best
Current heuristic is a linear chain of all nodes
Easy to generalize to include precomputed
neighbors
But, for all sky queries very large numberof
random reads instead of sequential

28
Ferris-Wheel

Sky split into buckets/zones
All archives scan in sync
Queries enter at bottom
Results come back afterfull circle
Only sequential accessgt buckets get into
cache,then queries processed

29
Data Access is hitting a wall

FTP and GREP are not adequate
You can GREP 1 MB in a second
You can GREP 1 GB in a minute
You can GREP 1 TB in 2 days
You can GREP 1 PB in 3 years.
Oh!, and 1PB 5,000 disks
At some point you need indices to limit
search parallel data search and analysis
This is where databases can help

You can FTP 1 MB in 1 sec
You can FTP 1 GB / min ( 1 /GB)
2 days and 1K
3 years and 1M

30
Smart Data (active databases)

If there is too much data to move around,
take the analysis to the data!
Do all data manipulations at database
Build custom procedures and functions in the
database
Automatic parallelism guaranteed
Easy to build-in custom functionality
Databases Procedures being unified
Example temporal and spatial indexing
Pixel processing
Easy to reorganize the data
Multiple views, each optimal for certain types of
analyses
Building hierarchical summaries are trivial
Scalable to Petabyte datasets

31
Generic Catalog Access

After 2 years of SDSS EDR and 6 months of DR1
usage, access patterns start to emerge
Lots of small users, instant response
1/f distribution of request sizes (tail of the
lognormal)
How to make everybody happy?
No clear business model
We need a separate interactive and batch server
We also need access to full SQL with extensions
Users want to access services via browsers
Other services will need SOAP access

32
Data Formats

Different data formats requested
HTML, CSV, FITS binary, VOTABLE, XML, graphics
Quick browsing and exploration
Small requests, need to be nicely rendered
Needs good random access performance
Also simple 2D scatter plots or density plots
required
Heavy duty statistical use
Aggregate functions on complex joins, lots of
scans but small output, mostly want CSV
Successive Data Filter
Multi-step non-indexed filtering of the whole
database,mostly want FITS binary

33
Data Delivery

Small requests (lt100MB)
Putting data on the stream
Medium requests (lt1GB)
Use DIME attachments to SOAP messages
Large requests (gt1GB)
Save data in scratch area and use asynch delivery
Only practical for large/long queries
Iterative requests
Save data in temp tables in user space
Let user manipulate via web browser
Paradox if we use web browser to submit, users
want immediate response from batch-size queries

34
How To Provide a UserDB

Goal through several search/filter operations
reduce data transfer to manageable sizes
(1-100MB)
Today people download tens of millions of rows,
and then do their next filtering on client side,
using F77
Could be much better done in the database
But users need to create/manage temporary tables
DOS attacks, fragmentation, who pays for it
Security, who can see my data (group access)?
Follow progress of long jobs
Who does the cleanup?

35
Query Management Service

Enable fast, anonymous access to small requests
Enable large queries, with ability to manage
Enable creation of temporary tables in user space
Create multiple ways to get query output
Needs to support multiple mirrors/load balancing
Do all this without logging in to Windows
Need also support of machine clients
Web Service http//skyservice.pha.jhu.edu/devel/C
asJobs/
Two request categories
Quick
Batch

36
Queue Management

Need to register batch power users
Query output goes to MyDB
Can be joined with source database
Results are materialized from MyDB upon request
Users can do
Insert, Drop, Create, Select Into, Functions,
Procedures
Publish their tables to a group area
Data delivery via the CASService (C WS)

http//skyservice.pha.jhu.edu/devel/CasService/Cas
Service.asmx

37
Summary

Exponential data growth distributed data
Web Services hierarchical architecture
Distributed computing Grid services
Primary access to data is through databases
The key to interoperability metadata, standards
Build upon industry standards, commercial tools,
and collaborate with the rest of the world
Give interesting new tools into the hands of
smart young people
they will quickly turn them into cutting edge
science

38
http//skyservice.pha.jhu.edu/develop/

Write a Comment

User Comments (0)

About PowerShow.com

Web Services and the VO: Using SDSS DR1 PowerPoint PPT Presentation