DiGIR Distributed Generic Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

DiGIR Distributed Generic Information Retrieval

Description:

To define a protocol for retrieving structured data from multiple, ... What 'binds' the schemas? The protocol schema defines various abstract types ... 'bind' ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 32
Provided by: digirSou
Category:

less

Transcript and Presenter's Notes

Title: DiGIR Distributed Generic Information Retrieval


1
DiGIRDistributed Generic Information Retrieval
  • Stan Blum, Dave Vieglais, P.J. Schwartz

2
Project Goals
  • To define a protocol for retrieving structured
    data from multiple, heterogeneous databases
  • To build a reference implementation of said
    protocol

3
Design Goals
  • To use open protocols and standards, such as
    HTTP, XML, and UDDI to leverage existing and
    emerging technologies
  • To de-couple the protocol, software and semantics
  • To automate the establishment of a new data
    provider as much as possible

4
High-level Architecture
  • Protocol
  • Provider
  • Portal
  • Registry

5
Protocol
  • Defines request and response message formats for
    communication between Provider and Portal
  • Assumes Providers conform to a known federation
    schema
  • Remains flexible to allow for federation schema
    pluggability

6
Provider
  • Makes structured data available to portals
  • Communicates via protocol compliant messaging
    only
  • Complies with a known federation schema
  • Supplies meta-data to describe data
    classification and availability

7
Portal
  • The entry point for a user
  • Can make requests of N number of providers
  • Communicates via protocol compliant messaging
    only
  • Queries registry for available providers
  • Can determine, based on provider meta-data,
    whether a provider should be queried

8
Project Information
  • The DiGIR project is a collaborative effort
  • DiGIR is currently established as an open source
    project on SourceForge (http//sourceforge.net).
  • Further documentation is available on the
    SourceForge site.
  • Please join us in collaborating!

9
Protocol Details
10
Protocol Details
  • Specified in an XML Schema (.xsd)
  • Intended to work in conjunction with federation
    schemas, also expressed as XML Schemas
  • Actual request and response documents are
    instance documents conforming to both the
    protocol schema and a federation schema

11
  • ltrequest xmlns"http//www.namespaceTBD.org/digir"
    xmlnsdarwin"http//www.namespaceTBD.org/darwin"
    xmlnsxsi"http//www.w3.org/2001/XMLSchema-insta
    nce" xsischemaLocation"http//www.namespaceTBD.o
    rg/digir digir.xsd http//www.namespaceTBD.org/dar
    win darwin.xsd"gt
  • ltheadergt
  • ltrequestTypegtsearchlt/requestTypegt
  • lt/headergt
  • ltsearchgt
  • ltdbNamegtmyDiggableBipesDBlt/dbNamegt
  • ltfiltergt
  • ltandgt
  • ltingt
  • ltlist xsitypedarwinlistgt
  • ltdarwinMonthgt11lt/darwinMonthgt
  • ltdarwinMonthgt12lt/darwinMonthgt
  • lt/listgt
  • lt/ingt
  • ltequalsgt
  • ltdarwinGenusgtBipeslt/darwinGenusgt
  • lt/equalsgt
  • lt/andgt
  • lt/filtergt

12
Request Explanation
  • Composed of elements from the protocol namespace
    (default) and the schema namespace
  • ltheadergt contains information about the payload
  • ltsearchgt contains dbName, filter, and record
    specification (will also specify result format)
  • ltfiltergt is effectively an XML representation of
    a SQL where clause
  • This search request is for the first 50 specimen
    records that are genus Bipes and were found in
    the months of November or December.

13
Filter Building
  • LOPs (logical operators)
  • ltandgt
  • ltorgt
  • ltandNotgt
  • ltorNotgt
  • Can be nested
  • COPs (comparison ops)
  • ltequalsgt
  • ltlessThangt
  • ltlessThanOrEqualsgt
  • ltnotEqualsgt
  • ltgreaterThangt
  • ltgreaterThanOrEqualsgt
  • ltlikegt
  • ltingt (multi value)

14
What binds the schemas?
  • The protocol schema defines various abstract
    types and elements
  • ltxsdelement name"searchCondition"
    abstract"true"gt
  • ltxsdelement name"alphaSearchCondition"
    abstract"true
  • substitutionGroup"searchCondition"gt
  • ltxsdcomplexType name"listType"
    abstract"true" /gt
  • ltxsdcomplexType name"numericListType"
    abstract"true" /gt
  • A federation schema must define searchable
    concepts, or groups of them, as substitutable for
    these abstract elements or extensions of the
    abstract types
  • ltxsdelement name"Species" type"xsdstring
  • substitutionGroup"digiralphaSearchCondition"
    /gt

15
  • ltxsdcomplexType name"list
  • ltxsdcomplexContentgt
  • ltxsdextension base"digirlistType"gt
  • ltxsdsequencegt
  • ltxsdchoicegt
  • ltxsdelement ref"ScientificName"
    maxOccurs"unbounded"/gt
  • ltxsdelement ref"Kingdom"
    maxOccurs"unbounded" /gt
  • ltxsdelement ref"Phylum"
    maxOccurs"unbounded" /gt
  • ltxsdelement ref"Class"
    maxOccurs"unbounded" /gt
  • ltxsdelement ref"Order"
    maxOccurs"unbounded" /gt
  • ltxsdelement ref"Family"
    maxOccurs"unbounded" /gt
  • ltxsdelement ref"Genus"
    maxOccurs"unbounded" /gt
  • ltxsdelement ref"Species"
    maxOccurs"unbounded" /gt
  • ltgt
  • lt/xsdchoicegt
  • lt/xsdsequencegt
  • lt/xsdextensiongt
  • lt/xsdcomplexContentgt
  • lt/xsdcomplexTypegt

16
Why bind like this?
  • To provide data-typing (string, numeric, etc.)
    for various concepts within operators at an
    abstract level (e.g. LIKE only valid for string
    data IN allows for multiples, but in a
    controlled fashion)
  • To allow for federation schemas to simply
    classify data as types without having to
    redefine/extend operators

17
Request Issues
  • Do we need another abstract element such as
    dateSearchCondition?
  • What information will be useful in the header?
  • How should we specify the format of the results?
    What standard formats should be offered (I.e.
    brief, full?).
  • Will tblName be part of the meta-data required of
    providers?
  • What concepts of Darwin Core 2 are searchable?

18
Response Prototype
  • ltresponse xmlns"http//www.namespaceTBD.org/digir
    " xmlnsdarwin"http//www.namespaceTBD.org/darwin
    " xmlnsxsi"http//www.w3.org/2001/XMLSchema-inst
    ance" xsischemaLocation"http//www.namespaceTBD.
    org/digir digir.xsd http//www.namespaceTBD.org/da
    rwin darwin.xsd"gt
  • ltheadergt
  • lt!-- contents TBD --gt
  • lt/headergt
  • ltcontentgt
  • ltrecordgt
  • lt/recordgt
  • lt/contentgt
  • ltdiagnosticsgt
  • lt/diagnosticsgt
  • lt/responsegt

19
Response Issues
  • How do we format and validate the response
    content?
  • What elements are needed for the ltheadergt, if
    any?
  • Do we always have diagnostics, or only if there
    is an error?
  • Should a finite set of diagnostics be created and
    maintained in its own XML Schema? Will there
    ever be a diagnostic that is specific to a
    federation schema?

20
Provider Details
21
Provider Details
  • Implemented as a web application that answers
    questions
  • Interface is not specific to a particular
    information domain
  • No state information is recorded
  • Each request is treated as unique and
    uninfluenced by previous requests
  • Must always generate a valid response
  • Consists of four key components
  • Request handler
  • Filter handler
  • Result set cache
  • Response generator

22
Request Handler
  • Receives XML document
  • Validates document
  • Generates internal structures for further
    processing

23
Filter Handler
  • Internal structural representation of filter
    (query) structure
  • Responsible for generating a native query string
    for querying the database
  • Communicates with UDDI to obtain standard
    database definition
  • Custom configured to work with specific database
    implementation

24
Result Set Cache
  • Contains the results of applying a query
  • Responsible for generating the response records
    in the requested format
  • Somewhat directly integrated with the response
    generator

25
Response Generator
  • Generates the response XML document
  • Serializes the response header information
  • Serializes diagnostic information
  • Serializes the requested subset of records

26
Provider Configuration
27
Portal Details
28
Portal Details
  • Divided into two distinct components a
    presentation layer and PortalServices
  • The presentation layer supports the UI and
    translates requests (HTTP requests from forms or
    links) into protocol compliant XML requests
  • The presentation layer also handles all display
    issues involving the responses, such as format,
    sorting, collating, etc
  • The presentation layer is envisioned to be an
    application server/web server implementation

29
Portal Details
  • PortalServices handles all external network
    activity (UDDI calls, provider calls, etc)
  • PortalServices limits provider calls to those
    necessary based on provider meta-data
  • PortalServices threads provider calls for
    increased performance (I.e. response time)
  • PortalServices is envisioned to be a webapp and
    supporting classes running within an application
    server, such as TomCat

30
PortalServices
  • RegistryAccess
  • ProviderCache
  • PortalConfig
  • PortalServlet
  • PortalRequestHandler
  • ProviderFilterer
  • Marshallers

31
Portal Issues
  • What information will be stored in UDDI about a
    provider?
  • What information will be known for communicating
    with a Provider (I.e. IP address, port, etc?)
  • What meta-data will be provided and what are the
    rules for using such data for provider filtering?
  • What requirements are there for logging and
    monitoring?
Write a Comment
User Comments (0)
About PowerShow.com