Massively Scalable, Uniform Data Discovery - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Massively Scalable, Uniform Data Discovery

Description:

Name Lookup (URIs URLs) can 'hide' entire process of locating & loading objects ... Search aggregators auto-handle Atom feed, traverse result sets ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 30
Provided by: thomas388
Category:

less

Transcript and Presenter's Notes

Title: Massively Scalable, Uniform Data Discovery


1
Massively Scalable, Uniform Data Discovery
Access
Brian Wilson and Zhangfan Xing Jet Propulsion
Laboratory
Secondary Rant The Importance of Permanent Names
2
Problem
  • How do we index all of the science data in the
    world and provide a uniform interface for search?
  • Including space/time query on Earth, Mars, etc.
  • Use Googles Approach?
  • Their goal is to index the WWW of knowledge
  • And GEarth/KML is starting on the geolocation
    problem
  • Who?
  • Will Google eventually do it?
  • Follow-on to EOS Clearinghouse (ECHO) project?

Carbon Cycle
3
AQUA Client Architecture
Data Providers
Collection Discovery, Granule Query
Carbon Cycle
LARC
AQUA GUI Clients
2nd AQUA
NSIDC_ECS
ECHO SOAP Services
Order Items
AQUA REST SOAP Services
GSFCS4PA
Granule Query, Order Items, Etc.
ORNL_DAAC
List Collections
3rd AQUA
Query Order
Etc.
Time-Segmented Granule Query
Machine Clients
Fetch Data or Browse Files Using URLs
4
Data Stewardship Using URIs
Query Service (e.g. ECHO)
Name Resolver
Space/Time or Metadata Query
List of URIs or EchoItemIds
Access Data via List of URLs
Carbon Cycle
  • Query URI Resolution
  • Query Resolution services could be provided by
    separate institutions, replicated, or transferred
    (NASA ? NOAA)
  • Name Lookup (URIs ? URLs) can hide entire
    process of locating loading objects from tape
    archive (ordering data)
  • URIs assigned by hierarchical naming authorities
  • Example -- AIRS project permanently names its
    products usgovnasaeosAIRSAIRS.2003.01.02.00
    4.L2.RetStd
  • Enabling Long-Term Climate Science
  • Systems must scale to huge queries orders
  • Support repeatable science analysis over years of
    data
  • AQUA ECHO Client (Automated Query and Access,
    recent ACCESS ECHO grant, PI Wilson)

5
Interfaces, Interfaces, Interfaces
  • You can never be too simple, or have too many
    interfaces.
  • AJAX GUI for Humans
  • Also visible interface for marketing
  • Machine-callable SOAP services
  • For large-scale automation, human out of the loop
  • GeoRegionQuery, OrderItems, etc.
  • Equivalent REST interface for services
  • rest.py?_serviceaqua_methodGeoRegionQuery
  • OpenSearch (REST) interface
  • Everything is search, and all results are feeds.
  • See opensearch.org

Carbon Cycle
6
AQUA Open Search Interface
  • Discover a Collection
  • http/server/aqua//collections?qwatervaporstart
    Index1-count200formatatom
  • Returns Atom feed listing collections satisfying
    query keywords
  • Metadata included in XML feed
  • Space/Time Query for Granules
  • http//server/aqua/-/granules/providerId/datasetSh
    ortName?time2006-01-01T000000,2006-02-01T0000
    00bboxsouth,west,north,eastresponseGroupsLar
    gestartIndex1count200formatkml
  • Returns Atom feed (or KML document!)
  • Granule metadata (time, georssbox) and URLs
    included
  • OpenSearch (GData) Features
  • GoogleData standard uses and extends OpenSearch
  • Search aggregators auto-handle Atom feed,
    traverse result sets
  • Ingest data feed into Google Earth or Google Apps

Carbon Cycle
7
AQUA Space/Time Query Result
Carbon Cycle
8
AQUA Space/Time Query Result (2)
Carbon Cycle
  • Granule metadata includes
  • Time range
  • Bounding box
  • Link to browse image
  • DAP URL pointing to data file (if possible),
    otherwise ftp

9
AQUA Space/Time Query Result (3)
Carbon Cycle
10
Paradigm Roles
  • Data Provider Role
  • Publish data and metadata files onto the Web
  • Granule metadata includes space/time box to
    support query
  • Done. No push into repository.
  • Provider assign permanent names to data objects
    and runs name resolver (translate URI/XRI ? URL)
  • Indexer Role
  • Crawl and index metadata (maybe only once)
  • Providers and indexer must agree on metadata
    standards
  • Provide uniform search interface (possibly
    opensearch protocol returning Atom feed or KML)
  • Permanent Name Resolver
  • Global resolver delegates to proper naming
    authority (provider)
  • End Users
  • Use GEarth or app. like it to search by
    space/time, keywords
  • Or machine agent uses opensearch protocol

Carbon Cycle
11
Global-Scale Data Discovery
  • Publish metadata in KML or some standard format
  • Metadata file published alongside data file
  • Granule metadata includes space/time box, domain
    dataset keywords or ontology terms, variable
    list, etc.
  • Permanent name for data object
  • New metadata advertised via datacast (RSS feed)
  • Crawl and Index
  • Google or someone crawls and indexes metadata
  • Search by time, space, metadata fields, keywords
  • Also ontology-enhanced search term broadening,
    etc.
  • Results are lists of permanent names (URI, XRI)
  • Delegated Name Resolution
  • Resolution delegated to proper naming authority
  • Translate each URI/XRI to one or more URLs
    (preferably DAP).
  • Data Access
  • Browse viz. using KML
  • Slice data using DAP URLs

Carbon Cycle
12
Steps Toward Massive Scaling
  • Publish ? Index ? Query ? Perm. Names ? URLs ?
    Data Slice ? Visualization
  • Transfer of data from tape to disk could be
    hidden inside an asynchronous Name Resolution
    service

Carbon Cycle
13
Importance of Permanent Names
  • URLs rot, URNs not directly resolvable
  • Need scalable, global name resolver
  • Delegated naming authority
  • Also a problem for the semantic web
  • Algorithmic Naming
  • urnusgovnasaeosairsAIRS.2003.01.02.004.L2.Re
    tStd
  • XRI Name Resolution
  • XRIs are backwardly compatible with URIs
  • Submit name to global resolver
  • Resolution delegated down the chain to entity
    that has naming authority can translate
    XRI/URIs to URLs.
  • http//eos.nasa.gov/airs/(urnusgovnasaeosairs
    AIRS.2003.01.02.004.L2.RetStd)
  • AIRS project assigns names and runs local name
    resolver

Carbon Cycle
14
Global-Scale Data Discovery
  • Publish metadata in KML or some standard format
  • Metadata file published alongside data file
  • Granule metadata includes space/time box, domain
    dataset keywords or ontology terms, variable
    list, etc.
  • Permanent name for data object
  • New metadata advertised via datacast (RSS feed)
  • Crawl and Index
  • Google or someone crawls and indexes metadata
  • Search by time, space, metadata fields, keywords
  • Also ontology-enhanced search term broadening,
    etc.
  • Results are lists of permanent names (URI, XRI)
  • Delegated Name Resolution
  • Resolution delegated to proper naming authority
  • Translate each URI/XRI to one or more URLs
    (preferably DAP).
  • Data Access
  • Browse viz. using KML
  • Slice data using DAP URLs

Carbon Cycle
15
Questions
  • Metadata standards
  • Embed metadata in modular, micro-formats within
    KML
  • Workable?
  • Crawl and Index
  • Google already provides lat/lon search for KML
    Placemarks
  • Our problem is more technical more massive
  • Who will do it?
  • Scalable Name Resolution Using XRIs
  • XRI resolvers do scale like DNS
  • But W3C wont accept XRI standard from OASIS
  • Permanent names are very imporant, but are hardly
    used on the web.
  • Data Access
  • Why cant we just get along?
  • Why doesnt every provider run DAP servers?

Carbon Cycle
16
Alogrithmic Naming
  • Transparent, Content-Full Names
  • Not opaque identifiers, like DOI, LSID, UUID
  • Construct name using deterministic algorithm
  • usgovnasaeosairsAIRS.2003.01.02.004.L2.RetStd
  • Reversible can extract metadata from the name
  • Using URLs as Permanent Names
  • http//airs.jpl.nasa/gov/data/L2/2003/01/AIRS.2003
    .01.02.004.L2.RetStd.hdf
  • URL never changes support using multi-homing
    hosts and symbolic links
  • Each project designs filenames and directory
    structure
  • Crawler Not Necessary
  • To locate a data file, construct its permanent
    URL using the algorithm and see if it exists

Carbon Cycle
17
Using URLs as Permanent Names
  • Delegated Naming Authority
  • The Real Estate grab has already occurred. Use
    it.
  • http//airs.eos.nasa.gov/data/L2/RetStd/2003/01/AI
    RS.2003.01.02.004.L2.RetStd.hdf
  • URL never changes support by using multi-homing
    hosts, Unix symbolic links, etc.
  • Cultural Problem
  • URLs arent designed or expected to be permanent
  • Data providers need to think permanently
  • Each project designs algorithmic filenames,
    dataset directory structure, permanent URLs
  • Crawler Name Resolver Not Necessary
  • To locate a data file, construct its permanent
    URL using the algorithm and see if it exists.

Carbon Cycle
18
eXtensible Resource Identifiers (XRI)
  • Features
  • Backwardly compatible with URI/IRI (xri// ?
    https//xri.)
  • Cross-references any XRI can contain another
    XRI/URI
  • Global context symbols , _at_, , , or !
  • I-names, i-numbers used in OpenId
  • Decentralization private/decentralized root
    authorities possible, peer-to-peer addressing
  • Delegation, Federation flexible namespace
    authorities
  • Persistence by adding a new abstract layer of
    addressing/resolving on top of DNS IP numbers
  • Human and machine-friendly formats i-names for
    humans, i-numbers for machines
  • Simple, extensible resolution XRDS description
    docs.
  • Trusted resolution HTTPS, optional SAML
    assertions
  • Resolution can be independent of DNS (feature?)
  • Fully internationalized, leveraging Unicode IRI
    specs.
  • Transport independent not bound to a specific
    protocol (http)
  • XDI XRI Data Interchange

Carbon Cycle
19
XRI Features (2)
  • Scalable Resolution Algorithm
  • Submit name to global resolver
  • Resolution delegated down the chain to entity
    that has naming authority can translate
    XRI/URIs to URLs.
  • Cross-Link XRIs
  • http//eos.nasa.gov/airs/(urnusgovnasaeosairs
    AIRS.2003.01.02.004.L2.RetStd)
  • Global proxy resolver
  • http//xri.net
  • I-name examples
  • Mary.Jones (people)
  • _at_Jones.and.Company (organizations)
  • phone.number (generic concepts)
  • Mary.Jones/(phone.number)
  • _at_Jones.and.Company/(phone.number)

Carbon Cycle
20
The Handle System
  • Alternative to URIs or URNs
  • Open source software, developed by CNRI
  • Operational system
  • Global name resolver delegates to local resolvers
  • Simple, so scalable
  • Format ltLocal Resolvergt/ltPermanent Namesgt
  • Possible Handle Scheme
  • us.gov.nasa.eos.airs/AIRS.2003.01.02.004.L2.RetStd
  • AIRS project runs local name resolver
  • Permanent names adapted from unique granule names
    or ECHO ItemIds
  • Resolver translates permanent name to one or more
    URLs
  • Query ? Perm. Names ? URLs ? Data Access
  • Names ? URIs done by local name resolver
  • Resolver maintains list of URLs by crawling data
    archives
  • Transfer of data from tape to disk could be
    hidden inside the Name Resolution service

Carbon Cycle
21
(No Transcript)
22
Datasets in SciFlo
  • A SciFlo Dataset is
  • Specified as a space/time query over collections
    of data products (or retrieved physical
    variables)
  • GeoRegionQuery(DataProduct, TimeRange,
    LatLonRegion)
  • GeoRegionQuery(PhysicalVariable, TimeRange,
    LatLonRegion)
  • Realized as a list of object IDs or URIs
    (permanent names)
  • GeoRegionQuery returns unique objectIds along
    with geolocation metadata
  • Accessed using a list of URLs pointing to
    on-line replicas of the data objects (files).
  • FindDataById(objectIds) ? URLs (ftp, http, or
    OpenDAP)
  • Translate unique object IDs into list of on-line
    locations in DataPools or any SciFlo node
  • DataPools SciFlo P2P network are crawled to
    update distributed translation tables
  • No need to publish presence of data, continuously
    discovered
  • SciFlo network is a distributed cache for
    scientific datasets

23
GeoRegionQuery ? Dataset
Space/Time Query for MODIS Dataset
MOD04, L2, VersionId
Carbon Cycle
TimeRange (start, end)
GeoRegionQuery MODIS
List of Object IDs GeoLocation Metadata
GeoRegion (lat/lon rectangle)
FindDataById
Response Group (Small, M, or L)
List of object URLs (multiple URLs for each
object)
24
GeoRegionQuery ? Datasets
Query for Aerosol Optical Depth Variable Match
Multiple Instruments/Models
Query for AERONET L2
Query for MISR L2
GeoRegionQuery For AOD
Merged AOD Dataset
Query for MODIS L2
Query for IMPACT Model Grid
25
Example Workflow Locate 4 Datasets
Space/Time Query for GPS, AIRS, MODIS
Carbon Cycle
Plot Data Locations
26
Example Workflow Locate 4 Datasets
Space/Time Query for GPS, AIRS, MODIS
Carbon Cycle
Plot Data Locations
27
Plot of Granule Bounding Boxes
Space/Time Query for GPS, AIRS, MODIS
28
iEarth Flowchart
Space/Time Query for GPS, AIRS, MODIS
29
SciFlos Innovative Use of IT
  • Pervasive use of XML metadata
  • Computers are stupid because we dont give them
    enough semantic metadata!
  • XML metadata more important than files of numbers
  • XML Microformats micro for easy adoption
  • Ontologies concept maps / taxonomies for
    science domains
  • Intentional Programming
  • Visual Programming (in your web browser!)
  • Declare the workflow in an XML document minimize
    code
  • Programming at a human (conceptual) level
  • Powerful high-level programming language
    (python)
  • Data Stewardship thru Permanent Names
  • Permanent names are underutilized on the Internet
  • Product Query ? URIs ? URLs ? Data Access
  • URI usgovnasaeosAIRSAIRS.2003.01.02.004.L2.
    RetStd
  • ECHO ItemIds can serve as a form of URI

Carbon Cycle
30
Visual Dataflow Programming
GPS AIRS Level-2 Space/Time Matchup
  • Connect a series of services and operators into
    a dataflow
  • Drag services/operators from menu, and drop onto
    the canvas
  • Lay out the flowchart by moving nodes
  • Connect the input/output ports by drawing lines
  • User guided by matching up port names and types

(AJAX Javascript and SVG Programming by Gerald
Manipon Wilson)
Write a Comment
User Comments (0)
About PowerShow.com