Web Services and the VO: Using SDSS DR1 - PowerPoint PPT Presentation

About This Presentation
Title:

Web Services and the VO: Using SDSS DR1

Description:

Title: PowerPoint Presentation Author: Alex Szalay Last modified by: Alex Szalay Created Date: 7/18/2000 8:38:03 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:112
Avg rating:3.0/5.0
Slides: 35
Provided by: AlexS171
Learn more at: http://www.fdis.org
Category:
Tags: sdss | dr1 | fuzzy | joins | services | using | web

less

Transcript and Presenter's Notes

Title: Web Services and the VO: Using SDSS DR1


1
Web Services and the VOUsing SDSS DR1
  • Alex Szalay, and Jim Gray
  • with
  • Tamas Budavari, Sam Carlisle, Vivek Haridas,
    Nolan Li, Tanu Malik, Maria Nieto-Santisteban,
    Wil OMullane, Ani Thakar

2
Changing Roles
  • Exponential growth
  • Projects last at least 3-5 years
  • Data sent upwards only at the end of the project
  • Data will be never centralized
  • More responsibility on projects
  • Becoming Publishers and Curators
  • Larger fraction of budget spent on software
  • Lot of development duplicated, wasted
  • More standards are needed
  • Easier data interchange, fewer tools
  • More templates are needed
  • Develop less software on your own

3
Standards and Interoperability
  • Standards driven by e-business requirements
  • Exchange of rich and structured data (XML)
  • DB connectivity, Web Services, Grid computing
  • Application to astronomy domain
  • Data dictionaries (UCDs)
  • Data models
  • Protocols
  • Registries and resource/service discovery
  • Provenance, data quality
  • Dealing with the astronomy legacy
  • FITS data format
  • Software systems

Boundary conditions
4
Virtual Observatory
  • Many new surveys are coming
  • SDSS is a dry run for the next ones
  • LSST will be 5TB/night
  • All the data will be on the Internet
  • But how? ftp, webservice
  • Data and apps will be associated withthe
    instruments
  • Distributed world wide
  • Cross-indexed
  • Federation is a must, but how?
  • Will be the best telescope in the world
  • World Wide Telescope

5
SkyServerSkyServer.SDSS.orgor
Skyserver.Pha.Jhu.edu/DR1/
  • Sloan Digital Sky Survey Data Pixels Data
    Mining
  • About 400 attributes per object
  • Spectrograms for 1 of objects
  • Demo pixel space record space set
    space teaching

6
Show Cutout Web Service
7
SkyQuery (http//skyquery.net/)
  • Distributed Query tool using a set of web
    services
  • Four astronomy archives from Pasadena, Chicago,
    Baltimore, Cambridge (England).
  • Feasibility study, built in 6 weeks
  • Tanu Malik (JHU CS grad student)
  • Tamas Budavari (JHU astro postdoc)
  • With help from Szalay, Thakar, Gray
  • Implemented in C and .NET
  • Allows queries like

SELECT o.objId, o.r, o.type, t.objId FROM
SDSSPhotoPrimary o, TWOMASSPhotoPrimary t
WHERE XMATCH(o,t)lt3.5 AND AREA(181.3,-0.76,6.5)
AND o.type3 and (o.I - t.m_j)gt2
8
SkyQuery Structure
  • Each SkyNode publishes
  • Schema Web Service
  • Database Web Service
  • Portal is
  • Plans Query (2 phase)
  • Integrates answers
  • Is itself a web service

9
National Virtual Observatory
  • NSF ITR project, Building the Framework for the
    National Virtual Observatory is a collaboration
    of 17 funded and 3 unfunded organizations
  • Astronomy data centers
  • National observatories
  • Supercomputer centers
  • University departments
  • Computer science/information technology
    specialists
  • Trying to build standards, interfaces and
    prototypes
  • Goal federate datasets, enable new
    discoveries,make it easier to publish new data

10
International Collaboration
  • Similar efforts now in more than 12 countries
  • USA, Canada, UK, France, Germany, Italy, Japan,
    Australia, India, China, Korea, Russia, South
    Africa, Hungary,.
  • Active collaboration among projects
  • Standards, common demos
  • International VO roadmap being developed
  • Regular telecons over 10 timezones
  • Formal collaboration
  • International Virtual Observatory Alliance (IVOA)

11
NVO How Will It Work?
  • Huge pressure to build something useful today
  • We do not build everything for everybody
  • Use the 90-10 rule
  • Define the standards and interfaces
  • Build the framework
  • Build the 10 of services that are used by 90
  • Let the users build the rest from the components
  • Define commonly used core services
  • Build higher level toolboxes/portals on top

12
Core Services
  • Metadata information about resources
  • Waveband
  • Sky coverage
  • Translation of names to universal dictionary
    (UCD)
  • Simple search patterns on the resources
  • Cone Search
  • Image mosaic
  • Unit conversions
  • Simple filtering, counting, histogramming
  • On-the-fly recalibrations

13
Higher Level Services
  • Built on Core Services
  • Perform more complex tasks
  • Examples
  • Automated resource discovery
  • Cross-identifications
  • Photometric redshifts
  • Outlier detections
  • Visualization facilities (connect pixels to
    objects)
  • Expectation
  • Build custom portals in matter of days from
    existing building blocks (like today in IRAF or
    IDL)

14
Using SDSS DR1 as a Prototype
  • SDSS DR1 (Data Release1) is now publicly
    available
  • http//skyserver.pha.jhu.edu/dr1/
  • About 1TB of catalog data
  • Using MS SQL Server 2000
  • Complex schema (72 Tables)
  • About 80 million photometric objects
  • Two versions (TARGET/BEST)
  • Automated documentation
  • Raw data at FNAL file server with URL access

15
DR1 SkyServer
  • Classic 2-node web server (20k total)
  • 1TB database
  • 1M hpm (180k peak day)
  • DSS load killing us12 m rows per hour downloads
  • Answer set size follows power law

Jim Gray Quick Performance Measure of SkyServer
Dr1 4 June 2003 1500 PST
16
Loading DR1
  • Automated table driven workflow system for
    loading
  • Included lots of verification code
  • Over 16K lines of SQL code
  • Loading process was extremely painful
  • Lack of systems engineering for the pipelines
  • Poor testing (lots of foreign key mismatch)
  • Detected data bugs even a month ago
  • Most of the time spent on scrubbing data
  • Fixing corrupted files (RAID5 disk errors)
  • Once data was clean, everything loaded in 3 days
  • Neighbors calculation took about 10 hours
  • Reorganization of data took about 1 week of
    experiments in partitioning/layouts

17
Public Data Release Versions
  • June 2000 EDR
  • Early Data Release
  • July 2003 DR1
  • Contains 30 of final data
  • 200 million photo objects
  • 4 versions of the data
  • Target, best, runs, spectro
  • Total catalog volume 1.7TB
  • See Terascale sneakernet paper
  • Published releases served forever
  • EDR, DR1, DR2, .
  • Soon to include email archives, annotations
  • O(N2) only possible because of Moores Law!

EDR
18
Organization and Reorganization
  • Introduced partitions and filegroups
  • Photo, Tag, Neighbors, Spectro, Frame, Other,
    Profiles
  • Keep partitions under 100GB
  • Vertical partitioning tried and abandoned
  • Both partitioning and index build now table
    driven
  • Stored procedures to create/drop indices at
    various granularities
  • Tremendous improvement in performance when doing
    this on a large memory machine (24GB)
  • Also much better performance afterwards
  • But this was nice but not hard

19
Spatial Features
  • Precomputed Neighbors
  • All objects within 30
  • Boundaries, Masks and Outlines
  • Stored as spatial polygons
  • Time Domain
  • Precomputed Match
  • All objects with 1, observed at different times
  • Found duplicates due to telescope tracking errors
  • Manual fix, recorded in the database
  • MatchHead
  • The first observation of the linked list used as
    unique id to chain of observations of the same
    object

20
Spatial Data Access SQL extension
  • Szalay, Gray, Kunszt, Fekete, OMullane, Brunner
    http//www.sdss.jhu.edu/htm
  • Added Hierarchical Triangular Mesh (HTM)
    table-valued functions for spatial joins
  • Every object has a 20-deep Mesh ID
  • Given a spatial definition,routine returns up to
    10 covering triangles
  • Spatial query is then up to 10 range queries
  • Fast 10,000 triangles / second / cpu

21
Web Services in Progress
  • Registry
  • Harvesting and querying, discovery of new
    services
  • Data Delivery
  • Query driven Queue management
  • MyDB, VOSpace, VOProfile minimize data movement
  • Graphics and visualization
  • Query driven vs interactive
  • Show spatial objects (Chart/Navi/List)
  • Footprint/intersect
  • It is a fractal
  • Cross-matching
  • SkyQuery and SkyNode
  • Ferris-wheel
  • Distributed vs parallel

22
Graphics/Visualization Tools
  • Density plot
  • Show densities of attributes as a function of sky
    position
  • Chart/Navi/List
  • Tie together catalogs and pixels
  • Spectrum viewer
  • Display spectra of galaxies and stars drawn from
    the database
  • Filter profiles
  • Catalog of astronomical filters (optical
    bandpasses)
  • Mirage with VO connections
  • Linked multi-pane visualization tool (Bell Labs)
  • VO extensions built at JHU

23
Other Tools
  • Spatial
  • Cone Search
  • SkyNode
  • CrossMatch (SkyQuery)
  • Footprint
  • Information Management
  • Registry services
  • Name resolver
  • Cosmological calculator
  • CASService
  • MyDB
  • VOSpace
  • User Authentication

24
Registry Easy Clients
  • Just use SOAP toolkit (T. McGlynn J. Lee have
    done Perl client).
  • Easy in Java
  • java org.apache.axis.wsdl.WSDL2Java
    "http//skyservice.pha.jhu.edu/devel/registry/regi
    stry.asmx?wsdl"
  • Gives set of Classes for accessing the service
  • Gives Classes for the XML which is returned (i.e.
    SimpleResource)
  • Still need to write client like
  • RegistryLocator loc new RegistryLocator()
  • RegistrySoap reg loc.getRegistrySoap()
  • ArrayOfSimpleResource reses null
  • reses reg.queryRegistry(args0)
  • http//skyservice.pha.jhu.edu/devel/registry/index
    .aspx

Demo
25
Archive Footprint
  • Footprint is a fractal
  • Result depends on context
  • all sky, degree scale, pixel scale
  • Translate to web services
  • Footprint()returns single region that contains
    the archive
  • Intersection(region, tolerance) feed a region
    and returns the intersection with archive
    footprint
  • Contains(point) returns yes/no (maybe fuzzy) if
    point is inside archive footprint

26
Cross-Matching
  • SkyQuery SkyNode
  • Currently lots of proprietary features
  • Data transmitted via .NET DataSet gt VOTable
  • Query plan written in MS T-SQL gt ADQL
  • Spatial operator restricted to a cone
    gtVORegion
  • Made up metadata delivery gt VORegistry
  • Data delivery in XML/HTML gt VOTable
  • Catalogs in the near future
  • SDSS DR1, FIRST, 2MASS, INT
  • POSS-1, GSC-2, HST, ROSAT, 2dF
  • GALEX, IRAS, PSCZ

27
Spatial Cross-Match
  • For small area HTM is close to optimal, but needs
    more speed
  • For all-sky surveys the zone algorithm is best
  • Current heuristic is a linear chain of all nodes
  • Easy to generalize to include precomputed
    neighbors
  • But, for all sky queries very large numberof
    random reads instead of sequential

28
Ferris-Wheel
  • Sky split into buckets/zones
  • All archives scan in sync
  • Queries enter at bottom
  • Results come back afterfull circle
  • Only sequential accessgt buckets get into
    cache,then queries processed

29
Data Access is hitting a wall
  • FTP and GREP are not adequate
  • You can GREP 1 MB in a second
  • You can GREP 1 GB in a minute
  • You can GREP 1 TB in 2 days
  • You can GREP 1 PB in 3 years.
  • Oh!, and 1PB 5,000 disks
  • At some point you need indices to limit
    search parallel data search and analysis
  • This is where databases can help
  • You can FTP 1 MB in 1 sec
  • You can FTP 1 GB / min ( 1 /GB)
  • 2 days and 1K
  • 3 years and 1M

30
Smart Data (active databases)
  • If there is too much data to move around,
  • take the analysis to the data!
  • Do all data manipulations at database
  • Build custom procedures and functions in the
    database
  • Automatic parallelism guaranteed
  • Easy to build-in custom functionality
  • Databases Procedures being unified
  • Example temporal and spatial indexing
  • Pixel processing
  • Easy to reorganize the data
  • Multiple views, each optimal for certain types of
    analyses
  • Building hierarchical summaries are trivial
  • Scalable to Petabyte datasets

31
Generic Catalog Access
  • After 2 years of SDSS EDR and 6 months of DR1
    usage, access patterns start to emerge
  • Lots of small users, instant response
  • 1/f distribution of request sizes (tail of the
    lognormal)
  • How to make everybody happy?
  • No clear business model
  • We need a separate interactive and batch server
  • We also need access to full SQL with extensions
  • Users want to access services via browsers
  • Other services will need SOAP access

32
Data Formats
  • Different data formats requested
  • HTML, CSV, FITS binary, VOTABLE, XML, graphics
  • Quick browsing and exploration
  • Small requests, need to be nicely rendered
  • Needs good random access performance
  • Also simple 2D scatter plots or density plots
    required
  • Heavy duty statistical use
  • Aggregate functions on complex joins, lots of
    scans but small output, mostly want CSV
  • Successive Data Filter
  • Multi-step non-indexed filtering of the whole
    database,mostly want FITS binary

33
Data Delivery
  • Small requests (lt100MB)
  • Putting data on the stream
  • Medium requests (lt1GB)
  • Use DIME attachments to SOAP messages
  • Large requests (gt1GB)
  • Save data in scratch area and use asynch delivery
  • Only practical for large/long queries
  • Iterative requests
  • Save data in temp tables in user space
  • Let user manipulate via web browser
  • Paradox if we use web browser to submit, users
    want immediate response from batch-size queries

34
How To Provide a UserDB
  • Goal through several search/filter operations
    reduce data transfer to manageable sizes
    (1-100MB)
  • Today people download tens of millions of rows,
    and then do their next filtering on client side,
    using F77
  • Could be much better done in the database
  • But users need to create/manage temporary tables
  • DOS attacks, fragmentation, who pays for it
  • Security, who can see my data (group access)?
  • Follow progress of long jobs
  • Who does the cleanup?

35
Query Management Service
  • Enable fast, anonymous access to small requests
  • Enable large queries, with ability to manage
  • Enable creation of temporary tables in user space
  • Create multiple ways to get query output
  • Needs to support multiple mirrors/load balancing
  • Do all this without logging in to Windows
  • Need also support of machine clients
  • Web Service http//skyservice.pha.jhu.edu/devel/C
    asJobs/
  • Two request categories
  • Quick
  • Batch

36
Queue Management
  • Need to register batch power users
  • Query output goes to MyDB
  • Can be joined with source database
  • Results are materialized from MyDB upon request
  • Users can do
  • Insert, Drop, Create, Select Into, Functions,
    Procedures
  • Publish their tables to a group area
  • Data delivery via the CASService (C WS)
  • http//skyservice.pha.jhu.edu/devel/CasService/Cas
    Service.asmx

37
Summary
  • Exponential data growth distributed data
  • Web Services hierarchical architecture
  • Distributed computing Grid services
  • Primary access to data is through databases
  • The key to interoperability metadata, standards
  • Build upon industry standards, commercial tools,
    and collaborate with the rest of the world
  • Give interesting new tools into the hands of
    smart young people
  • they will quickly turn them into cutting edge
    science

38
http//skyservice.pha.jhu.edu/develop/
Write a Comment
User Comments (0)
About PowerShow.com