Data Information Service based on Open Archives Initiative Protocols and Apache Lucene - PowerPoint PPT Presentation

About This Presentation
Title:

Data Information Service based on Open Archives Initiative Protocols and Apache Lucene

Description:

1PANGAEA Group at MARUM, University of Bremen, Bremen, Germany ... WDC-MARE with its information system PANGAEA currently provides data portals ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 15
Provided by: awi95
Learn more at: https://www.panfmp.org
Category:

less

Transcript and Presenter's Notes

Title: Data Information Service based on Open Archives Initiative Protocols and Apache Lucene


1
Data Information Servicebased onOpen Archives
Initiative ProtocolsandApache Lucene
  • Uwe Schindler1, Benny Bräuer2, Michael
    Diepenbroek1
  • 1PANGAEA Group at MARUM, University of Bremen,
    Bremen, Germany
  • 2Alfred Wegener Institute for Polar and Marine
    Research, Bremerhaven, Germany

2
Metadata Portals Grid
  • WDC-MARE with its information system PANGAEA
    currently provides data portals for several
    EU/international projects
  • Not all data are stored centralized, so all
    datasets provided in portals must be consolidated
    from different sources!
  • Features
  • Data stays at the data providers
  • Metadata is harvested by the portal
  • Search queries are handled by the centralized
    catalogue (Google-like search speed!)
  • Scientist gets link to data at the provider
  • Metadata portal software is sufficient for
    C3-Grid, too!

3
Metadata in C3-Grid
  • Goal build up an infrastructure for earth system
    community in Germany
  • Problem we need an architecture which makes it
    possible to
  • Collect metadata files from data providers
  • Store them in a central index
  • Provide a fast, generic access to this data for
    our users

Solution
Data Information Service
4
Open Archives Protocol
  • The Open Archives Initiative Protocol for
    Metadata Harvesting (OAI-PMH) is a protocol
    developed by the Open Archives Initiative.
  • Almost all digital libraries support it (most
    famous ones Fedora, arXiv and the CERN Document
    Server)
  • Portals by Scientific Commons, OAIster, SUB
  • uses it during web crawling (if
    available)
  • Very simple to implement (XML over HTTP-REST)
  • Repository software for databases or file system
    metadata providers is widely available (C3 uses
    mostly DLESE jOAI software on the data provider
    side)

5
Metadata in C3-Grid
  • Goal build up an infrastructure for earth system
    community in Germany
  • Problem we need an architecture which makes it
    possible to
  • Collect metadata files from data providers
  • Store them in a central index
  • Provide a fast, generic access to this data for
    our users

Solution
Data Information Service
6
Central indexing requirements
  1. Open for any XML metadata format
  2. Any mappings to document fields should be done by
    XPath
  3. Possibility to map incompatible XML schemas
    during harvesting by XSLT on-the-fly
  4. On-the-fly validation of (transformed) documents
    during harvesting
  5. No relational database, only a full text search
    engine, that contains everything needed for
    operation
  6. Range queries on specific fields (date/time or
    numeric)
  7. Web service interface / programming API for the
    end user interface that is accessible from any
    language (Java/JSP, PHP, Perl,...)

7
features
  • Ranked searching - best results returned first
  • Many powerful query types phrase queries,
    wildcard queries, proximity queries, range
    queries for date time values and numbers
  • Fielded searching. All fields are searchable as a
    whole, each field separately (e.g. for author,
    parameter), or mixed.
  • Any combination of boolean operators between
    search terms (AND, OR, NOT, exact phrase)
  • Sorting by any field
  • Multiple-index searching with merged results
  • Simultaneous searching and updates due to
    high-performance indexing

8
Generic Framework
9
Metadata in C3-Grid
  • Goal build up an infrastructure for earth system
    community in Germany
  • Problem we need an architecture which makes it
    possible to
  • Collect metadata files from data providers
  • Store them in a central index
  • Provide a fast, generic access to this data for
    our users

Solution
Data Information Service
10
Search Interface
  • Supports all standard Lucene search features
  • Additional support for fast range queries to
    enable bounding boxes, etc.
  • implemented by redundant storage of numerical
    terms in different precisions
  • recursive reduction of distinct terms (every
    numerical value is a term) on range query
  • search time no longer dependent on index size
  • Accessible via Java API or AXIS web service

11
Metadata in C3-Grid
  • Goal build up an infrastructure for earth system
    community in Germany
  • Problem we need an architecture which makes it
    possible to
  • Collect metadata files from data providers
  • Store them in a central index
  • Provide a fast, generic access to this data for
    our users

Solution
Data Information Service
12
C3 Implementation
Apache Lucene index
Portal
Google-style and range queries
Field Term Document
identifier ABC123 2
identifier XYZ223 6
identifier MI6007 12
abstract region 2,23,112
abstract pressure 3,23
abstract humid 4,33,215
min_lat 030.43 1
min_lat -023.23 2
data_uri http//... 4
web service frontend
indexing of selected fields
full-text index
DIS
harvesting backend
Metadata1.xml, Metadata2.xml, Metadata3.xml,
Metadata4.xml, ...
OAI-PMH
CERA
PANGAEA
Other Data Provider
document cache
Fig. by T. Langhammer, ZIB
13
Future
assemble workflow
data query
processing
workflowquery
metadata of workflows
metadata of data
14
Thank You!
  • Software will be available soon as open source on
    Sourceforge.net!
  • News http//wiki.pangaea.de/wiki/Portal
Write a Comment
User Comments (0)
About PowerShow.com