Metacat: metadata and data management - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Metacat: metadata and data management

Description:

... document, regardless of schema, can be stored without modifications to metacat. Metacat's data model follows the XML Document Object ... Conversion uses XSLT ... – PowerPoint PPT presentation

Number of Views:226
Avg rating:3.0/5.0
Slides: 53
Provided by: christ390
Category:

less

Transcript and Presenter's Notes

Title: Metacat: metadata and data management


1
Metacat metadata and data management
  • KNB Data Management Tools Workshop
  • Matthew Jones
  • National Center for Ecological Analysis and
    Synthesis
  • University of California, Santa Barbara

2
Metacat
  • Flexible storage system for storing and accessing
    metadata and data

3
Roadmap
  • Part I Introduction to Metacat capabilities
  • Overview
  • Metacat web interface
  • Registries and Repositories
  • Features
  • Questions
  • Part II Metacat Design and Architecture
  • Architecture overview
  • Storage subsystem
  • Query subsystem
  • Transformation subsystem
  • Authentication subsystem
  • Validation subsystem
  • Client API

4
Metacat
  • Flexible storage system for metadata and data
  • Stores arbitrary metadata documents (requires
    XML)
  • Supports structured searches
  • Customizable web interface
  • Replication capabilities
  • Works on Linux, Windows, MacOS
  • Oracle, Postgres, MS SQL Server

5
KNB Overview
Clients
Server
6
Building the KNB network
Key
Metacat Catalog
Morpho clients
Web clients
Site metadata system
XML output filter
7
Knowledge Network for Biocomplexity
Knowledge Network for Biocomplexity
LTER Network (24) Organization of Biological
Field Stations (180) UC Natural Reserve System
(36) Partnership for Interdisciplinary Studies of
Coastal Oceans (4) Multi-agency Rocky Intertidal
Network (60)
Metacat node
Site-specific node
8
Metacat web user interface
9
Metacat UI is reconfigurable
10
Data Registries
  • UC NRS Information System
  • NRS Network, NCEAS
  • Resource Discovery Initiative for Field Stations
    (RDIFS)
  • LTER Network, OBFS Network, NCEAS, San Diego
    Supercomputer Center, University of Kansas
  • Use metacat
  • Web-based metadata entry

11
Metacat features
  • Metadata
  • Store search any XML formatted metadata
  • Ecological Metadata Language (EML)
  • NBII Biological Data Profile
  • FGDC CSDGM
  • Site specific formats
  • Metadata validation
  • Configure to accept particular metadata formats
  • Enforces access control rules
  • Metadata conversion (using XSLT)
  • To HTML for presentation
  • To other metadata formats (e.g., NBII)
  • Data
  • Storage
  • Access control

12
Metacat implementation
  • Java servlet for portability
  • Linux, Windows 2000, MacOS X
  • HTTP access
  • Standard POST and GET queries
  • Web HTML interface via XSLT transforms
  • Separates content from presentation
  • Interfaces with RDBMS for storage
  • Oracle, PostgreSql, (SQL Server) backend

13
Advanced features
  • Replication
  • Synchronize content between 2 metacat servers
  • Harvesting
  • Scheduled pull of XML documents from web
    sources

14
Questions
  • Discussion?
  • Questions/Comments?

15
Roadmap
  • Part I Introduction to Metacat capabilities
  • Overview
  • Metacat web interface
  • Registries and Repositories
  • Features
  • Questions
  • Part II Metacat Design and Architecture
  • Architecture overview
  • Storage subsystem
  • Query subsystem
  • Transformation subsystem
  • Authentication subsystem
  • Other subsystems
  • Client API

16
Metacat architecture
17
Roadmap
  • Part I Introduction to Metacat capabilities
  • Overview
  • Metacat web interface
  • Registries and Repositories
  • Features
  • Questions
  • Part II Metacat Design and Architecture
  • Architecture overview
  • Storage subsystem
  • Query subsystem
  • Transformation subsystem
  • Authentication subsystem
  • Other subsystems
  • Client API

18
Storage subsystem
  • Storage
  • XML metadata stored in relational db
  • Oracle, PostgresQL, (SQL Server)
  • Data object storage on filesystem
  • Data viewed as opaque objects
  • Assigned unique identifier
  • Metadata describes data structure semantics

19
Versioning and Identifiers
  • Metacat prescribes a format for identifiers
  • Identifiers are a contract regarding uniqueness
  • Incorporates both identity and version
  • Two data streams with the same ID are defined as
    identical
  • insert requires a unique ID
  • update requires an existing ID with a new
    revision
  • Will be adopting Life Science Identifiers (LSID)
  • E.g., urnlsidecoinformatics.org263

obfs.23.4
scope
identifier
revision
20
Storage actions
  • Read
  • Download a document
  • Insert
  • Put a new XML document in the database
  • Update
  • Replace an existing xml document with a new
    version, incrementing the identifer
  • Delete
  • Archive a document so that it does not show up in
    searches
  • Upload
  • Put a binary or other non-xml in the file system

21
Reading a document
  • Simple HTTP GET or POST request
  • http//a.com/knb/metacat?actionreaddocidknb-lte
    r-gce.109.5
  • Return document is in XML format by default
  • Login is optional
  • If you dont login, you have public privileges

22
A simple web client
23
The simple client html code
  • ltform action/knb/metacat" target"right"
    method"POST"gt
  • ltstronggt1. Choose an action lt/stronggt
  • ltinput type"radio" name"action"
    value"insert" checkedgt Insert
  • ltinput type"radio" name"action"
    value"update"gt Update
  • ltinput type"radio" name"action"
    value"delete"gt Delete
  • ltinput type"submit" value"Process Action"gt
  • ltbr /gt
  • ltstronggt2. Provide a Document ID lt/stronggt
  • ltinput type"text" name"docid"gt
  • ltbr /gt
  • ltstronggt3. Provide XML text lt/stronggt (not
    needed for Delete)
  • ltstronggt
  • ltbr /gt
  • lttextarea name"doctext" cols"65"
    rows"15"gtlt/textareagt
  • ltbr /gt
  • ltstronggt4. Provide DTD text for upload
    lt/stronggt (optional not needed for Delete)
  • ltbr /gt
  • lttextarea name"dtdtext" cols"65"
    rows"15"gtlt/textareagt
  • ltbr /gt

24
Schema-independence
  • Most relational dbs support only one data model
  • Makes maintenance as models change expensive
  • Metacat is schema independent
  • Any XML document, regardless of schema, can be
    stored without modifications to metacat
  • Metacats data model follows the XML Document
    Object Model (DOM)
  • Thus, it models the XML structure rather than the
    data schema

25
DOM
  • DOM models hierarchical element and attribute
    structure of XML

Node types
element
eml
attribute
packageId
text
knb.1.2
dataset
title
Soil profiles in 1982
creator
individualName
givenName
Matthew
surName
Jones
26
Metacats Data Model
  • Metacat data model is a recursive structure
  • Gains in flexibility offset by performance
    penalties (were working on this)

27
Yeah, so what?
  • You can throw whatever you need in metacat
  • (without schema or software changes)
  • And you can query it
  • lt?xml version1.0?gt
  • ltpollgt
  • ltfavoriteOS id1gtMacOSlt/favoriteOSgt
  • ltfavoriteOS id2gtLinuxlt/favoriteOSgt
  • ltfavoriteOS id3gtLinuxlt/favoriteOSgt
  • ltfavoriteOS id4gtLinuxlt/favoriteOSgt
  • ltfavoriteOS id5gtWinXPlt/favoriteOSgt
  • ltfavoriteOS id6gtLinuxlt/favoriteOSgt
  • ltfavoriteOS id7gtLinuxlt/favoriteOSgt
  • lt/pollgt

28
Roadmap
  • Part I Introduction to Metacat capabilities
  • Overview
  • Metacat web interface
  • Registries and Repositories
  • Features
  • Questions
  • Part II Metacat Design and Architecture
  • Architecture overview
  • Storage subsystem
  • Query subsystem
  • Transformation subsystem
  • Authentication subsystem
  • Other subsystems
  • Client API

29
Query subsystem
  • Two means of submitting queries
  • Query action (query)
  • Query parameters passed as url-encoded form
    parameters (i.e., an html form)
  • Metacat builds a pathquery document automatically
  • Structured query action (squery)
  • Custom query syntax in xml format (pathquery)

30
HTML form queries
  • Parameters passed as form elements
  • Special fields
  • action, qformat, operator
  • returnfield, returndoctype
  • anyfield
  • Other fields create additional conditions
  • Metacat builds the query from the fields

query
31
Example HTML form
  • ltform method"POST" action"_at_servlet-path_at_"
    target"_top"gt
  • Search for
  • ltinput name"action" value"query"
    type"hidden"gt
  • ltinput name"operator" value"INTERSECT
    type"hidden"gt
  • ltinput name"anyfield" type"text"
    value"size"14"gt
  • ltinput name"organizationName"
    value"Organization of Biological Field Stations"
    type"hidden"gt
  • ltinput name"qformat" valuexml"
    type"hidden"gt
  • ltinput name"returnfield" value"creator/individ
    ualName/surName" type"hidden"gt
  • ltinput name"returnfield" value"creator/individ
    ualName/givenName" type"hidden"gt
  • ltinput name"returnfield" value"creator/organiz
    ationName" type"hidden"gt
  • ltinput name"returnfield" value"dataset/title"
    type"hidden"gt
  • ltinput name"returnfield" value"keyword"
    type"hidden"gt
  • ltinput name"returndoctype" value"eml//ecoinfo
    rmatics.org/eml-2.0.1" type"hidden"gt
  • ltinput value"Start Search" type"submit"gt
  • lt/formgt

32
Query Result Set Structure
  • lt?xml version1.0?gt
  • ltresultsetgt
  • ltdocumentgt
  • ltdocidgtknb.2.1lt/docidgt
  • ltdocnamegtemllt/docnamegt
  • ltdoctypegteml//ecoinformatics.org/eml-2.0.
    1lt/doctypegt
  • ltdoctitlegtSoil profiles from lower
    Yosemite Valleylt/doctitlegt
  • ltcreatedategt2000-06-10 125407lt/createdat
    egt
  • ltupdatedategt2000-06-10 125407lt/updatedat
    egt
  • ltparam name/eml/dataset/creator/individu
    alName/surName"gtLevingslt/paramgt\
  • ltparam name/eml/dataset/creator/individu
    alName/surName"gtShriverlt/paramgt
  • ltparam name/eml/dataset/keywordSet/keywo
    rd"gtstratalt/paramgt
  • ltparam name/eml/dataset/keywordSet/keywo
    rd"gtmineralizationlt/paramgt
  • lt/documentgt
  • ltdocumentgt
  • lt/documentgt
  • lt/resultsetgt

33
Structured queries
  • Pathquery syntax
  • Can build precise queries against arbitrary
    metadata schemas
  • Boolean combinations of conditions (AND, OR)
  • Uses Xpath-like syntax
  • Specify document types to search
  • Specify fields to return in resultset

squery
34
Query Conditions
  • Language independent representation of a query
    structure
  • Transformed into the appropriate native language
    of the data store
  • Example
  • ltquerygroup operatorUNION"gt
  • ltqueryterm searchmode"contains"
    casesensitive"false"gt
  • ltvaluegtsoillt/valuegt
  • ltpathexprgtdataset/titlelt/pathexprgt
  • lt/querytermgt
  • ltqueryterm searchmode"contains"
    casesensitive"false"gt
  • ltvaluegtsoillt/valuegt
  • ltpathexprgtdataset/titlelt/pathexprgt
  • lt/querytermgt
  • lt/querygroupgt

35
Specifying the Resultset
  • Specify the list of fields to be returned in the
    resultset
  • Simple paths used to identify elements or
    document subtrees
  • Effectively flattens the structure of the
    records, but allows generic representation (i.e,
    multiple standards)
  • Example
  • ltreturnfieldgtdataset/titlelt/returnfieldgt
  • ltreturnfieldgtcreator/individualName/surNamelt/ret
    urnfieldgt
  • ltreturnfieldgtkeywordlt/returnfieldgt

36
Full Query Example
  • lt?xml version"1.0"?gt
  • ltpathquery version"1.2"gt
  • ltquerytitlegtSoil searchlt/querytitlegt
  • ltreturndoctypegteml//ecoinformatics.org/eml-2.0.
    0lt/returndoctypegt
  • ltreturnfieldgtcreator/individualName/surNamelt/ret
    urnfieldgt
  • ltreturnfieldgtkeywordlt/returnfieldgt
  • ltquerygroup operatorUNION"gt
  • ltqueryterm searchmode"contains"
    casesensitive"false"gt
  • ltvaluegtsoillt/valuegt
  • lt/querytermgt
  • ltqueryterm searchmode"contains"
    casesensitive"false"gt
  • ltvaluegtsoillt/valuegt
  • lt/querytermgt
  • lt/querygroupgt
  • ltpathquerygt

37
Query Result Set Structure
  • lt?xml version1.0?gt
  • ltresultsetgt
  • ltdocumentgt
  • ltdocidgtknb.2.1lt/docidgt
  • ltdocnamegtemllt/docnamegt
  • ltdoctypegteml//ecoinformatics.org/eml-2.0.
    1lt/doctypegt
  • ltdoctitlegtSoil profiles from lower
    Yosemite Valleylt/doctitlegt
  • ltcreatedategt2000-06-10 125407lt/createdat
    egt
  • ltupdatedategt2000-06-10 125407lt/updatedat
    egt
  • ltparam name/eml/dataset/creator/individu
    alName/surName"gtLevingslt/paramgt\
  • ltparam name/eml/dataset/creator/individu
    alName/surName"gtShriverlt/paramgt
  • ltparam name/eml/dataset/keywordSet/keywo
    rd"gtstratalt/paramgt
  • ltparam name/eml/dataset/keywordSet/keywo
    rd"gtmineralizationlt/paramgt
  • lt/documentgt
  • ltdocumentgt
  • lt/documentgt
  • lt/resultsetgt

38
Roadmap
  • Part I Introduction to Metacat capabilities
  • Overview
  • Metacat web interface
  • Registries and Repositories
  • Features
  • Questions
  • Part II Metacat Design and Architecture
  • Architecture overview
  • Storage subsystem
  • Query subsystem
  • Transformation subsystem
  • Authentication subsystem
  • Other subsystems
  • Client API

39
Transforming a document
  • Used to convert document before returning it
  • Conversion uses XSLT
  • Configuration of which style sheet to use is
    controlled by the skin via the qformat
    parameter
  • http//knb.ecoinformatics.org/knb/metacat?actionr
    eaddocidknb-lter-gce.109.5qformatltss
  • Return document is converted and returned
  • The skinname.xml file controls the mappings
    that determine which style sheet to use

40
Roadmap
  • Part I Introduction to Metacat capabilities
  • Overview
  • Metacat web interface
  • Registries and Repositories
  • Features
  • Questions
  • Part II Metacat Design and Architecture
  • Architecture overview
  • Storage subsystem
  • Query subsystem
  • Transformation subsystem
  • Authentication subsystem
  • Other subsystems
  • Client API

41
Authentication subsystem
  • Actions login and logout
  • Simple username/password system
  • Successful login creates a session
  • Session ID tracked using an HTTP cookie

42
Authentication plugins
  • Delegates authentication requests to a
    backend-service via a plugin
  • Lightweight Directory Access Protocol (LDAP)
  • Replaceable to interface with other systems
  • Metacat admin can choose which LDAP server to use

43
Ecoinformatics.org LDAP
  • Need for community-wide user identities
  • Distributed system for participating institutions
  • Root LDAP server refers requests to specific
    organizations for authentication

dcecoinformatics,dcorg
oNCEAS
oUCNRS
oLTER
ounaffiliated
44
Roadmap
  • Part I Introduction to Metacat capabilities
  • Overview
  • Metacat web interface
  • Registries and Repositories
  • Features
  • Questions
  • Part II Metacat Design and Architecture
  • Architecture overview
  • Storage subsystem
  • Query subsystem
  • Transformation subsystem
  • Authentication subsystem
  • Other subsystems
  • Client API

45
Other subsystems
  • actionvalidate
  • valtext, docid
  • actionsetaccess
  • docid, principal, permission, permType,
    permOrder, principal
  • actiongetversion
  • actiongetlog
  • ipaddress, principal, docid, event, start, end
  • http//68.111.43.2258080/knb/metacat?actiongetlo
    geventinsert

46
Roadmap
  • Part I Introduction to Metacat capabilities
  • Overview
  • Metacat web interface
  • Registries and Repositories
  • Features
  • Questions
  • Part II Metacat Design and Architecture
  • Architecture overview
  • Storage subsystem
  • Query subsystem
  • Transformation subsystem
  • Authentication subsystem
  • Validation subsystem
  • Client API

47
Client API
  • Application Programming Interface (API)
  • Defines language-specific binding for
    communicating with Metacat
  • Available in Java and Perl (partial python)
  • Allows development of new applications
  • Allows integration of metacat with existing
    applications
  • Simple set of method calls

48
Basic Client API
  • public String login(String username, String
    password)
  • public String logout()
  • public Reader read(String docid)
  • public Reader query(Reader xmlQuery)
  • public String insert(String docid, Reader
    xmlDocument, Reader schema)
  • public String update(String docid, Reader
    xmlDocument, Reader schema)
  • public String delete(String docid)
  • public String upload(String docid, File file)
  • public String upload(String docid, String
    fileName, InputStream fileData, int size)

49
Example use of client
  • String metacatUrl "http//foo.com/context/metaca
    t"
  • String username "uidjones,oNCEAS,dcecoinforma
    tics,dcorg"
  • String password "neverHarcodeAPasswordInCode"
  • try
  • Metacat m MetacatFactory.createMetacatConnectio
    n(metacatUrl)
  • m.login(username, password)
  • Reader r m.read("testdocument.1.1")
  • // Do whatever you want with Reader r
  • catch (MetacatAuthException mae)
  • handleError("Authorization failed\n"
    mae.getMessage())
  • catch (MetacatInaccessibleException mie)
  • handleError("Metacat Inaccessible\n"
    mie.getMessage())
  • catch (Exception e)
  • handleError("General exception\n"
    e.getMessage())

50
Overview
51
Documentation
52
Acknowledgements
This material is based upon work supported
by The National Science Foundation under Grant
Numbers 9980154, 9904777, 0131178, 9905838,
0129792, and 0225676. The National Center for
Ecological Analysis and Synthesis, a Center
funded by NSF (Grant Number 0072909), the
University of California, and the UC Santa
Barbara campus. The Andrew W. Mellon
Foundation. PBI Collaborators NCEAS, University
of New Mexico (Long Term Ecological Research
Network Office), San Diego Supercomputer Center,
University of Kansas (Center for Biodiversity
Research) Kepler contributors SEEK, Ptolemy II,
SDM/SciDAC, GEON
Write a Comment
User Comments (0)
About PowerShow.com