Title: Web Services and the VO: Using SDSS DR1
1Web Services and the VOUsing SDSS DR1
- Alex Szalay, and Jim Gray
- with
- Tamas Budavari, Sam Carlisle, Vivek Haridas,
Nolan Li, Tanu Malik, Maria Nieto-Santisteban,
Wil OMullane, Ani Thakar
2Changing Roles
- Exponential growth
- Projects last at least 3-5 years
- Data sent upwards only at the end of the project
- Data will be never centralized
- More responsibility on projects
- Becoming Publishers and Curators
- Larger fraction of budget spent on software
- Lot of development duplicated, wasted
- More standards are needed
- Easier data interchange, fewer tools
- More templates are needed
- Develop less software on your own
3Standards and Interoperability
- Standards driven by e-business requirements
- Exchange of rich and structured data (XML)
- DB connectivity, Web Services, Grid computing
- Application to astronomy domain
- Data dictionaries (UCDs)
- Data models
- Protocols
- Registries and resource/service discovery
- Provenance, data quality
- Dealing with the astronomy legacy
- FITS data format
- Software systems
Boundary conditions
4Virtual Observatory
- Many new surveys are coming
- SDSS is a dry run for the next ones
- LSST will be 5TB/night
- All the data will be on the Internet
- But how? ftp, webservice
- Data and apps will be associated withthe
instruments - Distributed world wide
- Cross-indexed
- Federation is a must, but how?
- Will be the best telescope in the world
- World Wide Telescope
5SkyServerSkyServer.SDSS.orgor
Skyserver.Pha.Jhu.edu/DR1/
- Sloan Digital Sky Survey Data Pixels Data
Mining - About 400 attributes per object
- Spectrograms for 1 of objects
- Demo pixel space record space set
space teaching
6Show Cutout Web Service
7SkyQuery (http//skyquery.net/)
- Distributed Query tool using a set of web
services - Four astronomy archives from Pasadena, Chicago,
Baltimore, Cambridge (England). - Feasibility study, built in 6 weeks
- Tanu Malik (JHU CS grad student)
- Tamas Budavari (JHU astro postdoc)
- With help from Szalay, Thakar, Gray
- Implemented in C and .NET
- Allows queries like
SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
8SkyQuery Structure
- Each SkyNode publishes
- Schema Web Service
- Database Web Service
- Portal is
- Plans Query (2 phase)
- Integrates answers
- Is itself a web service
9National Virtual Observatory
- NSF ITR project, Building the Framework for the
National Virtual Observatory is a collaboration
of 17 funded and 3 unfunded organizations - Astronomy data centers
- National observatories
- Supercomputer centers
- University departments
- Computer science/information technology
specialists - Trying to build standards, interfaces and
prototypes - Goal federate datasets, enable new
discoveries,make it easier to publish new data
10International Collaboration
- Similar efforts now in more than 12 countries
- USA, Canada, UK, France, Germany, Italy, Japan,
Australia, India, China, Korea, Russia, South
Africa, Hungary,. - Active collaboration among projects
- Standards, common demos
- International VO roadmap being developed
- Regular telecons over 10 timezones
- Formal collaboration
- International Virtual Observatory Alliance (IVOA)
11NVO How Will It Work?
- Huge pressure to build something useful today
- We do not build everything for everybody
- Use the 90-10 rule
- Define the standards and interfaces
- Build the framework
- Build the 10 of services that are used by 90
- Let the users build the rest from the components
- Define commonly used core services
- Build higher level toolboxes/portals on top
12Core Services
- Metadata information about resources
- Waveband
- Sky coverage
- Translation of names to universal dictionary
(UCD) - Simple search patterns on the resources
- Cone Search
- Image mosaic
- Unit conversions
- Simple filtering, counting, histogramming
- On-the-fly recalibrations
13Higher Level Services
- Built on Core Services
- Perform more complex tasks
- Examples
- Automated resource discovery
- Cross-identifications
- Photometric redshifts
- Outlier detections
- Visualization facilities (connect pixels to
objects) - Expectation
- Build custom portals in matter of days from
existing building blocks (like today in IRAF or
IDL)
14Using SDSS DR1 as a Prototype
- SDSS DR1 (Data Release1) is now publicly
available - http//skyserver.pha.jhu.edu/dr1/
- About 1TB of catalog data
- Using MS SQL Server 2000
- Complex schema (72 Tables)
- About 80 million photometric objects
- Two versions (TARGET/BEST)
- Automated documentation
- Raw data at FNAL file server with URL access
15DR1 SkyServer
- Classic 2-node web server (20k total)
- 1TB database
- 1M hpm (180k peak day)
- DSS load killing us12 m rows per hour downloads
- Answer set size follows power law
Jim Gray Quick Performance Measure of SkyServer
Dr1 4 June 2003 1500 PST
16Loading DR1
- Automated table driven workflow system for
loading - Included lots of verification code
- Over 16K lines of SQL code
- Loading process was extremely painful
- Lack of systems engineering for the pipelines
- Poor testing (lots of foreign key mismatch)
- Detected data bugs even a month ago
- Most of the time spent on scrubbing data
- Fixing corrupted files (RAID5 disk errors)
- Once data was clean, everything loaded in 3 days
- Neighbors calculation took about 10 hours
- Reorganization of data took about 1 week of
experiments in partitioning/layouts
17Public Data Release Versions
- June 2000 EDR
- Early Data Release
- July 2003 DR1
- Contains 30 of final data
- 200 million photo objects
- 4 versions of the data
- Target, best, runs, spectro
- Total catalog volume 1.7TB
- See Terascale sneakernet paper
- Published releases served forever
- EDR, DR1, DR2, .
- Soon to include email archives, annotations
- O(N2) only possible because of Moores Law!
EDR
18Organization and Reorganization
- Introduced partitions and filegroups
- Photo, Tag, Neighbors, Spectro, Frame, Other,
Profiles - Keep partitions under 100GB
- Vertical partitioning tried and abandoned
- Both partitioning and index build now table
driven - Stored procedures to create/drop indices at
various granularities - Tremendous improvement in performance when doing
this on a large memory machine (24GB) - Also much better performance afterwards
- But this was nice but not hard
19Spatial Features
- Precomputed Neighbors
- All objects within 30
- Boundaries, Masks and Outlines
- Stored as spatial polygons
- Time Domain
- Precomputed Match
- All objects with 1, observed at different times
- Found duplicates due to telescope tracking errors
- Manual fix, recorded in the database
- MatchHead
- The first observation of the linked list used as
unique id to chain of observations of the same
object
20Spatial Data Access SQL extension
- Szalay, Gray, Kunszt, Fekete, OMullane, Brunner
http//www.sdss.jhu.edu/htm - Added Hierarchical Triangular Mesh (HTM)
table-valued functions for spatial joins - Every object has a 20-deep Mesh ID
- Given a spatial definition,routine returns up to
10 covering triangles - Spatial query is then up to 10 range queries
- Fast 10,000 triangles / second / cpu
21Web Services in Progress
- Registry
- Harvesting and querying, discovery of new
services - Data Delivery
- Query driven Queue management
- MyDB, VOSpace, VOProfile minimize data movement
- Graphics and visualization
- Query driven vs interactive
- Show spatial objects (Chart/Navi/List)
- Footprint/intersect
- It is a fractal
- Cross-matching
- SkyQuery and SkyNode
- Ferris-wheel
- Distributed vs parallel
22Graphics/Visualization Tools
- Density plot
- Show densities of attributes as a function of sky
position - Chart/Navi/List
- Tie together catalogs and pixels
- Spectrum viewer
- Display spectra of galaxies and stars drawn from
the database - Filter profiles
- Catalog of astronomical filters (optical
bandpasses) - Mirage with VO connections
- Linked multi-pane visualization tool (Bell Labs)
- VO extensions built at JHU
23Other Tools
- Spatial
- Cone Search
- SkyNode
- CrossMatch (SkyQuery)
- Footprint
- Information Management
- Registry services
- Name resolver
- Cosmological calculator
- CASService
- MyDB
- VOSpace
- User Authentication
24Registry Easy Clients
- Just use SOAP toolkit (T. McGlynn J. Lee have
done Perl client). - Easy in Java
- java org.apache.axis.wsdl.WSDL2Java
"http//skyservice.pha.jhu.edu/devel/registry/regi
stry.asmx?wsdl" - Gives set of Classes for accessing the service
- Gives Classes for the XML which is returned (i.e.
SimpleResource) - Still need to write client like
- RegistryLocator loc new RegistryLocator()
- RegistrySoap reg loc.getRegistrySoap()
- ArrayOfSimpleResource reses null
- reses reg.queryRegistry(args0)
-
- http//skyservice.pha.jhu.edu/devel/registry/index
.aspx
Demo
25Archive Footprint
- Footprint is a fractal
- Result depends on context
- all sky, degree scale, pixel scale
- Translate to web services
- Footprint()returns single region that contains
the archive - Intersection(region, tolerance) feed a region
and returns the intersection with archive
footprint - Contains(point) returns yes/no (maybe fuzzy) if
point is inside archive footprint
26Cross-Matching
- SkyQuery SkyNode
- Currently lots of proprietary features
- Data transmitted via .NET DataSet gt VOTable
- Query plan written in MS T-SQL gt ADQL
- Spatial operator restricted to a cone
gtVORegion - Made up metadata delivery gt VORegistry
- Data delivery in XML/HTML gt VOTable
- Catalogs in the near future
- SDSS DR1, FIRST, 2MASS, INT
- POSS-1, GSC-2, HST, ROSAT, 2dF
- GALEX, IRAS, PSCZ
27Spatial Cross-Match
- For small area HTM is close to optimal, but needs
more speed - For all-sky surveys the zone algorithm is best
- Current heuristic is a linear chain of all nodes
- Easy to generalize to include precomputed
neighbors - But, for all sky queries very large numberof
random reads instead of sequential
28Ferris-Wheel
- Sky split into buckets/zones
- All archives scan in sync
- Queries enter at bottom
- Results come back afterfull circle
- Only sequential accessgt buckets get into
cache,then queries processed
29Data Access is hitting a wall
- FTP and GREP are not adequate
- You can GREP 1 MB in a second
- You can GREP 1 GB in a minute
- You can GREP 1 TB in 2 days
- You can GREP 1 PB in 3 years.
- Oh!, and 1PB 5,000 disks
- At some point you need indices to limit
search parallel data search and analysis - This is where databases can help
- You can FTP 1 MB in 1 sec
- You can FTP 1 GB / min ( 1 /GB)
- 2 days and 1K
- 3 years and 1M
30Smart Data (active databases)
- If there is too much data to move around,
- take the analysis to the data!
- Do all data manipulations at database
- Build custom procedures and functions in the
database - Automatic parallelism guaranteed
- Easy to build-in custom functionality
- Databases Procedures being unified
- Example temporal and spatial indexing
- Pixel processing
- Easy to reorganize the data
- Multiple views, each optimal for certain types of
analyses - Building hierarchical summaries are trivial
- Scalable to Petabyte datasets
31Generic Catalog Access
- After 2 years of SDSS EDR and 6 months of DR1
usage, access patterns start to emerge - Lots of small users, instant response
- 1/f distribution of request sizes (tail of the
lognormal) - How to make everybody happy?
- No clear business model
- We need a separate interactive and batch server
- We also need access to full SQL with extensions
- Users want to access services via browsers
- Other services will need SOAP access
32Data Formats
- Different data formats requested
- HTML, CSV, FITS binary, VOTABLE, XML, graphics
- Quick browsing and exploration
- Small requests, need to be nicely rendered
- Needs good random access performance
- Also simple 2D scatter plots or density plots
required - Heavy duty statistical use
- Aggregate functions on complex joins, lots of
scans but small output, mostly want CSV - Successive Data Filter
- Multi-step non-indexed filtering of the whole
database,mostly want FITS binary
33Data Delivery
- Small requests (lt100MB)
- Putting data on the stream
- Medium requests (lt1GB)
- Use DIME attachments to SOAP messages
- Large requests (gt1GB)
- Save data in scratch area and use asynch delivery
- Only practical for large/long queries
- Iterative requests
- Save data in temp tables in user space
- Let user manipulate via web browser
- Paradox if we use web browser to submit, users
want immediate response from batch-size queries
34How To Provide a UserDB
- Goal through several search/filter operations
reduce data transfer to manageable sizes
(1-100MB) - Today people download tens of millions of rows,
and then do their next filtering on client side,
using F77 - Could be much better done in the database
- But users need to create/manage temporary tables
- DOS attacks, fragmentation, who pays for it
- Security, who can see my data (group access)?
- Follow progress of long jobs
- Who does the cleanup?
35Query Management Service
- Enable fast, anonymous access to small requests
- Enable large queries, with ability to manage
- Enable creation of temporary tables in user space
- Create multiple ways to get query output
- Needs to support multiple mirrors/load balancing
- Do all this without logging in to Windows
- Need also support of machine clients
- Web Service http//skyservice.pha.jhu.edu/devel/C
asJobs/ - Two request categories
- Quick
- Batch
36Queue Management
- Need to register batch power users
- Query output goes to MyDB
- Can be joined with source database
- Results are materialized from MyDB upon request
- Users can do
- Insert, Drop, Create, Select Into, Functions,
Procedures - Publish their tables to a group area
- Data delivery via the CASService (C WS)
- http//skyservice.pha.jhu.edu/devel/CasService/Cas
Service.asmx
37Summary
- Exponential data growth distributed data
- Web Services hierarchical architecture
- Distributed computing Grid services
- Primary access to data is through databases
- The key to interoperability metadata, standards
- Build upon industry standards, commercial tools,
and collaborate with the rest of the world - Give interesting new tools into the hands of
smart young people - they will quickly turn them into cutting edge
science
38http//skyservice.pha.jhu.edu/develop/