Title: Massively Scalable, Uniform Data Discovery
1Massively Scalable, Uniform Data Discovery
Access
Brian Wilson and Zhangfan Xing Jet Propulsion
Laboratory
Secondary Rant The Importance of Permanent Names
2Problem
- How do we index all of the science data in the
world and provide a uniform interface for search? - Including space/time query on Earth, Mars, etc.
- Use Googles Approach?
- Their goal is to index the WWW of knowledge
- And GEarth/KML is starting on the geolocation
problem - Who?
- Will Google eventually do it?
- Follow-on to EOS Clearinghouse (ECHO) project?
Carbon Cycle
3AQUA Client Architecture
Data Providers
Collection Discovery, Granule Query
Carbon Cycle
LARC
AQUA GUI Clients
2nd AQUA
NSIDC_ECS
ECHO SOAP Services
Order Items
AQUA REST SOAP Services
GSFCS4PA
Granule Query, Order Items, Etc.
ORNL_DAAC
List Collections
3rd AQUA
Query Order
Etc.
Time-Segmented Granule Query
Machine Clients
Fetch Data or Browse Files Using URLs
4Data Stewardship Using URIs
Query Service (e.g. ECHO)
Name Resolver
Space/Time or Metadata Query
List of URIs or EchoItemIds
Access Data via List of URLs
Carbon Cycle
- Query URI Resolution
- Query Resolution services could be provided by
separate institutions, replicated, or transferred
(NASA ? NOAA) - Name Lookup (URIs ? URLs) can hide entire
process of locating loading objects from tape
archive (ordering data) - URIs assigned by hierarchical naming authorities
- Example -- AIRS project permanently names its
products usgovnasaeosAIRSAIRS.2003.01.02.00
4.L2.RetStd - Enabling Long-Term Climate Science
- Systems must scale to huge queries orders
- Support repeatable science analysis over years of
data - AQUA ECHO Client (Automated Query and Access,
recent ACCESS ECHO grant, PI Wilson)
5Interfaces, Interfaces, Interfaces
- You can never be too simple, or have too many
interfaces. - AJAX GUI for Humans
- Also visible interface for marketing
- Machine-callable SOAP services
- For large-scale automation, human out of the loop
- GeoRegionQuery, OrderItems, etc.
- Equivalent REST interface for services
- rest.py?_serviceaqua_methodGeoRegionQuery
- OpenSearch (REST) interface
- Everything is search, and all results are feeds.
- See opensearch.org
Carbon Cycle
6AQUA Open Search Interface
- Discover a Collection
- http/server/aqua//collections?qwatervaporstart
Index1-count200formatatom - Returns Atom feed listing collections satisfying
query keywords - Metadata included in XML feed
- Space/Time Query for Granules
- http//server/aqua/-/granules/providerId/datasetSh
ortName?time2006-01-01T000000,2006-02-01T0000
00bboxsouth,west,north,eastresponseGroupsLar
gestartIndex1count200formatkml - Returns Atom feed (or KML document!)
- Granule metadata (time, georssbox) and URLs
included - OpenSearch (GData) Features
- GoogleData standard uses and extends OpenSearch
- Search aggregators auto-handle Atom feed,
traverse result sets - Ingest data feed into Google Earth or Google Apps
Carbon Cycle
7AQUA Space/Time Query Result
Carbon Cycle
8AQUA Space/Time Query Result (2)
Carbon Cycle
- Granule metadata includes
- Time range
- Bounding box
- Link to browse image
- DAP URL pointing to data file (if possible),
otherwise ftp
9AQUA Space/Time Query Result (3)
Carbon Cycle
10Paradigm Roles
- Data Provider Role
- Publish data and metadata files onto the Web
- Granule metadata includes space/time box to
support query - Done. No push into repository.
- Provider assign permanent names to data objects
and runs name resolver (translate URI/XRI ? URL) - Indexer Role
- Crawl and index metadata (maybe only once)
- Providers and indexer must agree on metadata
standards - Provide uniform search interface (possibly
opensearch protocol returning Atom feed or KML) - Permanent Name Resolver
- Global resolver delegates to proper naming
authority (provider) - End Users
- Use GEarth or app. like it to search by
space/time, keywords - Or machine agent uses opensearch protocol
Carbon Cycle
11Global-Scale Data Discovery
- Publish metadata in KML or some standard format
- Metadata file published alongside data file
- Granule metadata includes space/time box, domain
dataset keywords or ontology terms, variable
list, etc. - Permanent name for data object
- New metadata advertised via datacast (RSS feed)
- Crawl and Index
- Google or someone crawls and indexes metadata
- Search by time, space, metadata fields, keywords
- Also ontology-enhanced search term broadening,
etc. - Results are lists of permanent names (URI, XRI)
- Delegated Name Resolution
- Resolution delegated to proper naming authority
- Translate each URI/XRI to one or more URLs
(preferably DAP). - Data Access
- Browse viz. using KML
- Slice data using DAP URLs
Carbon Cycle
12Steps Toward Massive Scaling
- Publish ? Index ? Query ? Perm. Names ? URLs ?
Data Slice ? Visualization - Transfer of data from tape to disk could be
hidden inside an asynchronous Name Resolution
service
Carbon Cycle
13Importance of Permanent Names
- URLs rot, URNs not directly resolvable
- Need scalable, global name resolver
- Delegated naming authority
- Also a problem for the semantic web
- Algorithmic Naming
- urnusgovnasaeosairsAIRS.2003.01.02.004.L2.Re
tStd - XRI Name Resolution
- XRIs are backwardly compatible with URIs
- Submit name to global resolver
- Resolution delegated down the chain to entity
that has naming authority can translate
XRI/URIs to URLs. - http//eos.nasa.gov/airs/(urnusgovnasaeosairs
AIRS.2003.01.02.004.L2.RetStd) - AIRS project assigns names and runs local name
resolver
Carbon Cycle
14Global-Scale Data Discovery
- Publish metadata in KML or some standard format
- Metadata file published alongside data file
- Granule metadata includes space/time box, domain
dataset keywords or ontology terms, variable
list, etc. - Permanent name for data object
- New metadata advertised via datacast (RSS feed)
- Crawl and Index
- Google or someone crawls and indexes metadata
- Search by time, space, metadata fields, keywords
- Also ontology-enhanced search term broadening,
etc. - Results are lists of permanent names (URI, XRI)
- Delegated Name Resolution
- Resolution delegated to proper naming authority
- Translate each URI/XRI to one or more URLs
(preferably DAP). - Data Access
- Browse viz. using KML
- Slice data using DAP URLs
Carbon Cycle
15Questions
- Metadata standards
- Embed metadata in modular, micro-formats within
KML - Workable?
- Crawl and Index
- Google already provides lat/lon search for KML
Placemarks - Our problem is more technical more massive
- Who will do it?
- Scalable Name Resolution Using XRIs
- XRI resolvers do scale like DNS
- But W3C wont accept XRI standard from OASIS
- Permanent names are very imporant, but are hardly
used on the web. - Data Access
- Why cant we just get along?
- Why doesnt every provider run DAP servers?
Carbon Cycle
16Alogrithmic Naming
- Transparent, Content-Full Names
- Not opaque identifiers, like DOI, LSID, UUID
- Construct name using deterministic algorithm
- usgovnasaeosairsAIRS.2003.01.02.004.L2.RetStd
- Reversible can extract metadata from the name
- Using URLs as Permanent Names
- http//airs.jpl.nasa/gov/data/L2/2003/01/AIRS.2003
.01.02.004.L2.RetStd.hdf - URL never changes support using multi-homing
hosts and symbolic links - Each project designs filenames and directory
structure - Crawler Not Necessary
- To locate a data file, construct its permanent
URL using the algorithm and see if it exists
Carbon Cycle
17Using URLs as Permanent Names
- Delegated Naming Authority
- The Real Estate grab has already occurred. Use
it. - http//airs.eos.nasa.gov/data/L2/RetStd/2003/01/AI
RS.2003.01.02.004.L2.RetStd.hdf - URL never changes support by using multi-homing
hosts, Unix symbolic links, etc. - Cultural Problem
- URLs arent designed or expected to be permanent
- Data providers need to think permanently
- Each project designs algorithmic filenames,
dataset directory structure, permanent URLs - Crawler Name Resolver Not Necessary
- To locate a data file, construct its permanent
URL using the algorithm and see if it exists.
Carbon Cycle
18eXtensible Resource Identifiers (XRI)
- Features
- Backwardly compatible with URI/IRI (xri// ?
https//xri.) - Cross-references any XRI can contain another
XRI/URI - Global context symbols , _at_, , , or !
- I-names, i-numbers used in OpenId
- Decentralization private/decentralized root
authorities possible, peer-to-peer addressing - Delegation, Federation flexible namespace
authorities - Persistence by adding a new abstract layer of
addressing/resolving on top of DNS IP numbers - Human and machine-friendly formats i-names for
humans, i-numbers for machines - Simple, extensible resolution XRDS description
docs. - Trusted resolution HTTPS, optional SAML
assertions - Resolution can be independent of DNS (feature?)
- Fully internationalized, leveraging Unicode IRI
specs. - Transport independent not bound to a specific
protocol (http) - XDI XRI Data Interchange
Carbon Cycle
19XRI Features (2)
- Scalable Resolution Algorithm
- Submit name to global resolver
- Resolution delegated down the chain to entity
that has naming authority can translate
XRI/URIs to URLs. - Cross-Link XRIs
- http//eos.nasa.gov/airs/(urnusgovnasaeosairs
AIRS.2003.01.02.004.L2.RetStd) - Global proxy resolver
- http//xri.net
- I-name examples
- Mary.Jones (people)
- _at_Jones.and.Company (organizations)
- phone.number (generic concepts)
- Mary.Jones/(phone.number)
- _at_Jones.and.Company/(phone.number)
Carbon Cycle
20The Handle System
- Alternative to URIs or URNs
- Open source software, developed by CNRI
- Operational system
- Global name resolver delegates to local resolvers
- Simple, so scalable
- Format ltLocal Resolvergt/ltPermanent Namesgt
- Possible Handle Scheme
- us.gov.nasa.eos.airs/AIRS.2003.01.02.004.L2.RetStd
- AIRS project runs local name resolver
- Permanent names adapted from unique granule names
or ECHO ItemIds - Resolver translates permanent name to one or more
URLs - Query ? Perm. Names ? URLs ? Data Access
- Names ? URIs done by local name resolver
- Resolver maintains list of URLs by crawling data
archives - Transfer of data from tape to disk could be
hidden inside the Name Resolution service
Carbon Cycle
21(No Transcript)
22Datasets in SciFlo
- A SciFlo Dataset is
- Specified as a space/time query over collections
of data products (or retrieved physical
variables) - GeoRegionQuery(DataProduct, TimeRange,
LatLonRegion) - GeoRegionQuery(PhysicalVariable, TimeRange,
LatLonRegion) - Realized as a list of object IDs or URIs
(permanent names) - GeoRegionQuery returns unique objectIds along
with geolocation metadata - Accessed using a list of URLs pointing to
on-line replicas of the data objects (files). - FindDataById(objectIds) ? URLs (ftp, http, or
OpenDAP) - Translate unique object IDs into list of on-line
locations in DataPools or any SciFlo node - DataPools SciFlo P2P network are crawled to
update distributed translation tables - No need to publish presence of data, continuously
discovered - SciFlo network is a distributed cache for
scientific datasets
23GeoRegionQuery ? Dataset
Space/Time Query for MODIS Dataset
MOD04, L2, VersionId
Carbon Cycle
TimeRange (start, end)
GeoRegionQuery MODIS
List of Object IDs GeoLocation Metadata
GeoRegion (lat/lon rectangle)
FindDataById
Response Group (Small, M, or L)
List of object URLs (multiple URLs for each
object)
24GeoRegionQuery ? Datasets
Query for Aerosol Optical Depth Variable Match
Multiple Instruments/Models
Query for AERONET L2
Query for MISR L2
GeoRegionQuery For AOD
Merged AOD Dataset
Query for MODIS L2
Query for IMPACT Model Grid
25Example Workflow Locate 4 Datasets
Space/Time Query for GPS, AIRS, MODIS
Carbon Cycle
Plot Data Locations
26Example Workflow Locate 4 Datasets
Space/Time Query for GPS, AIRS, MODIS
Carbon Cycle
Plot Data Locations
27Plot of Granule Bounding Boxes
Space/Time Query for GPS, AIRS, MODIS
28iEarth Flowchart
Space/Time Query for GPS, AIRS, MODIS
29SciFlos Innovative Use of IT
- Pervasive use of XML metadata
- Computers are stupid because we dont give them
enough semantic metadata! - XML metadata more important than files of numbers
- XML Microformats micro for easy adoption
- Ontologies concept maps / taxonomies for
science domains - Intentional Programming
- Visual Programming (in your web browser!)
- Declare the workflow in an XML document minimize
code - Programming at a human (conceptual) level
- Powerful high-level programming language
(python) - Data Stewardship thru Permanent Names
- Permanent names are underutilized on the Internet
- Product Query ? URIs ? URLs ? Data Access
- URI usgovnasaeosAIRSAIRS.2003.01.02.004.L2.
RetStd - ECHO ItemIds can serve as a form of URI
Carbon Cycle
30Visual Dataflow Programming
GPS AIRS Level-2 Space/Time Matchup
- Connect a series of services and operators into
a dataflow - Drag services/operators from menu, and drop onto
the canvas - Lay out the flowchart by moving nodes
- Connect the input/output ports by drawing lines
- User guided by matching up port names and types
(AJAX Javascript and SVG Programming by Gerald
Manipon Wilson)