Title: Metacat: metadata and data management
1Metacat metadata and data management
- KNB Data Management Tools Workshop
- Matthew Jones
- National Center for Ecological Analysis and
Synthesis - University of California, Santa Barbara
2Metacat
- Flexible storage system for storing and accessing
metadata and data
3Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Validation subsystem
- Client API
4Metacat
- Flexible storage system for metadata and data
- Stores arbitrary metadata documents (requires
XML) - Supports structured searches
- Customizable web interface
- Replication capabilities
- Works on Linux, Windows, MacOS
- Oracle, Postgres, MS SQL Server
5KNB Overview
Clients
Server
6Building the KNB network
Key
Metacat Catalog
Morpho clients
Web clients
Site metadata system
XML output filter
7Knowledge Network for Biocomplexity
Knowledge Network for Biocomplexity
LTER Network (24) Organization of Biological
Field Stations (180) UC Natural Reserve System
(36) Partnership for Interdisciplinary Studies of
Coastal Oceans (4) Multi-agency Rocky Intertidal
Network (60)
Metacat node
Site-specific node
8Metacat web user interface
9Metacat UI is reconfigurable
10Data Registries
- UC NRS Information System
- NRS Network, NCEAS
- Resource Discovery Initiative for Field Stations
(RDIFS) - LTER Network, OBFS Network, NCEAS, San Diego
Supercomputer Center, University of Kansas - Use metacat
- Web-based metadata entry
11Metacat features
- Metadata
- Store search any XML formatted metadata
- Ecological Metadata Language (EML)
- NBII Biological Data Profile
- FGDC CSDGM
- Site specific formats
- Metadata validation
- Configure to accept particular metadata formats
- Enforces access control rules
- Metadata conversion (using XSLT)
- To HTML for presentation
- To other metadata formats (e.g., NBII)
- Data
- Storage
- Access control
12Metacat implementation
- Java servlet for portability
- Linux, Windows 2000, MacOS X
- HTTP access
- Standard POST and GET queries
- Web HTML interface via XSLT transforms
- Separates content from presentation
- Interfaces with RDBMS for storage
- Oracle, PostgreSql, (SQL Server) backend
13Advanced features
- Replication
- Synchronize content between 2 metacat servers
- Harvesting
- Scheduled pull of XML documents from web
sources
14Questions
- Discussion?
- Questions/Comments?
15Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
16Metacat architecture
17Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
18Storage subsystem
- Storage
- XML metadata stored in relational db
- Oracle, PostgresQL, (SQL Server)
- Data object storage on filesystem
- Data viewed as opaque objects
- Assigned unique identifier
- Metadata describes data structure semantics
19Versioning and Identifiers
- Metacat prescribes a format for identifiers
- Identifiers are a contract regarding uniqueness
- Incorporates both identity and version
- Two data streams with the same ID are defined as
identical - insert requires a unique ID
- update requires an existing ID with a new
revision - Will be adopting Life Science Identifiers (LSID)
- E.g., urnlsidecoinformatics.org263
obfs.23.4
scope
identifier
revision
20Storage actions
- Read
- Download a document
- Insert
- Put a new XML document in the database
- Update
- Replace an existing xml document with a new
version, incrementing the identifer - Delete
- Archive a document so that it does not show up in
searches - Upload
- Put a binary or other non-xml in the file system
21Reading a document
- Simple HTTP GET or POST request
- http//a.com/knb/metacat?actionreaddocidknb-lte
r-gce.109.5 - Return document is in XML format by default
- Login is optional
- If you dont login, you have public privileges
22A simple web client
23The simple client html code
- ltform action/knb/metacat" target"right"
method"POST"gt - ltstronggt1. Choose an action lt/stronggt
- ltinput type"radio" name"action"
value"insert" checkedgt Insert - ltinput type"radio" name"action"
value"update"gt Update - ltinput type"radio" name"action"
value"delete"gt Delete - ltinput type"submit" value"Process Action"gt
- ltbr /gt
- ltstronggt2. Provide a Document ID lt/stronggt
- ltinput type"text" name"docid"gt
- ltbr /gt
- ltstronggt3. Provide XML text lt/stronggt (not
needed for Delete) - ltstronggt
- ltbr /gt
- lttextarea name"doctext" cols"65"
rows"15"gtlt/textareagt - ltbr /gt
- ltstronggt4. Provide DTD text for upload
lt/stronggt (optional not needed for Delete) - ltbr /gt
- lttextarea name"dtdtext" cols"65"
rows"15"gtlt/textareagt - ltbr /gt
24Schema-independence
- Most relational dbs support only one data model
- Makes maintenance as models change expensive
- Metacat is schema independent
- Any XML document, regardless of schema, can be
stored without modifications to metacat - Metacats data model follows the XML Document
Object Model (DOM) - Thus, it models the XML structure rather than the
data schema
25DOM
- DOM models hierarchical element and attribute
structure of XML
Node types
element
eml
attribute
packageId
text
knb.1.2
dataset
title
Soil profiles in 1982
creator
individualName
givenName
Matthew
surName
Jones
26Metacats Data Model
- Metacat data model is a recursive structure
- Gains in flexibility offset by performance
penalties (were working on this)
27Yeah, so what?
- You can throw whatever you need in metacat
- (without schema or software changes)
- And you can query it
- lt?xml version1.0?gt
- ltpollgt
- ltfavoriteOS id1gtMacOSlt/favoriteOSgt
- ltfavoriteOS id2gtLinuxlt/favoriteOSgt
- ltfavoriteOS id3gtLinuxlt/favoriteOSgt
- ltfavoriteOS id4gtLinuxlt/favoriteOSgt
- ltfavoriteOS id5gtWinXPlt/favoriteOSgt
- ltfavoriteOS id6gtLinuxlt/favoriteOSgt
- ltfavoriteOS id7gtLinuxlt/favoriteOSgt
- lt/pollgt
28Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
29Query subsystem
- Two means of submitting queries
- Query action (query)
- Query parameters passed as url-encoded form
parameters (i.e., an html form) - Metacat builds a pathquery document automatically
- Structured query action (squery)
- Custom query syntax in xml format (pathquery)
30HTML form queries
- Parameters passed as form elements
- Special fields
- action, qformat, operator
- returnfield, returndoctype
- anyfield
- Other fields create additional conditions
- Metacat builds the query from the fields
query
31Example HTML form
- ltform method"POST" action"_at_servlet-path_at_"
target"_top"gt - Search for
- ltinput name"action" value"query"
type"hidden"gt - ltinput name"operator" value"INTERSECT
type"hidden"gt - ltinput name"anyfield" type"text"
value"size"14"gt - ltinput name"organizationName"
value"Organization of Biological Field Stations"
type"hidden"gt -
- ltinput name"qformat" valuexml"
type"hidden"gt - ltinput name"returnfield" value"creator/individ
ualName/surName" type"hidden"gt - ltinput name"returnfield" value"creator/individ
ualName/givenName" type"hidden"gt - ltinput name"returnfield" value"creator/organiz
ationName" type"hidden"gt - ltinput name"returnfield" value"dataset/title"
type"hidden"gt - ltinput name"returnfield" value"keyword"
type"hidden"gt - ltinput name"returndoctype" value"eml//ecoinfo
rmatics.org/eml-2.0.1" type"hidden"gt -
- ltinput value"Start Search" type"submit"gt
- lt/formgt
32Query Result Set Structure
- lt?xml version1.0?gt
- ltresultsetgt
- ltdocumentgt
- ltdocidgtknb.2.1lt/docidgt
- ltdocnamegtemllt/docnamegt
- ltdoctypegteml//ecoinformatics.org/eml-2.0.
1lt/doctypegt - ltdoctitlegtSoil profiles from lower
Yosemite Valleylt/doctitlegt - ltcreatedategt2000-06-10 125407lt/createdat
egt - ltupdatedategt2000-06-10 125407lt/updatedat
egt - ltparam name/eml/dataset/creator/individu
alName/surName"gtLevingslt/paramgt\ - ltparam name/eml/dataset/creator/individu
alName/surName"gtShriverlt/paramgt - ltparam name/eml/dataset/keywordSet/keywo
rd"gtstratalt/paramgt - ltparam name/eml/dataset/keywordSet/keywo
rd"gtmineralizationlt/paramgt - lt/documentgt
- ltdocumentgt
-
- lt/documentgt
- lt/resultsetgt
33Structured queries
- Pathquery syntax
- Can build precise queries against arbitrary
metadata schemas - Boolean combinations of conditions (AND, OR)
- Uses Xpath-like syntax
- Specify document types to search
- Specify fields to return in resultset
squery
34Query Conditions
- Language independent representation of a query
structure - Transformed into the appropriate native language
of the data store - Example
- ltquerygroup operatorUNION"gt
- ltqueryterm searchmode"contains"
casesensitive"false"gt - ltvaluegtsoillt/valuegt
- ltpathexprgtdataset/titlelt/pathexprgt
- lt/querytermgt
- ltqueryterm searchmode"contains"
casesensitive"false"gt - ltvaluegtsoillt/valuegt
- ltpathexprgtdataset/titlelt/pathexprgt
- lt/querytermgt
- lt/querygroupgt
35Specifying the Resultset
- Specify the list of fields to be returned in the
resultset - Simple paths used to identify elements or
document subtrees - Effectively flattens the structure of the
records, but allows generic representation (i.e,
multiple standards) - Example
- ltreturnfieldgtdataset/titlelt/returnfieldgt
- ltreturnfieldgtcreator/individualName/surNamelt/ret
urnfieldgt - ltreturnfieldgtkeywordlt/returnfieldgt
36Full Query Example
- lt?xml version"1.0"?gt
- ltpathquery version"1.2"gt
- ltquerytitlegtSoil searchlt/querytitlegt
- ltreturndoctypegteml//ecoinformatics.org/eml-2.0.
0lt/returndoctypegt - ltreturnfieldgtcreator/individualName/surNamelt/ret
urnfieldgt - ltreturnfieldgtkeywordlt/returnfieldgt
- ltquerygroup operatorUNION"gt
- ltqueryterm searchmode"contains"
casesensitive"false"gt - ltvaluegtsoillt/valuegt
- lt/querytermgt
- ltqueryterm searchmode"contains"
casesensitive"false"gt - ltvaluegtsoillt/valuegt
- lt/querytermgt
- lt/querygroupgt
- ltpathquerygt
37Query Result Set Structure
- lt?xml version1.0?gt
- ltresultsetgt
- ltdocumentgt
- ltdocidgtknb.2.1lt/docidgt
- ltdocnamegtemllt/docnamegt
- ltdoctypegteml//ecoinformatics.org/eml-2.0.
1lt/doctypegt - ltdoctitlegtSoil profiles from lower
Yosemite Valleylt/doctitlegt - ltcreatedategt2000-06-10 125407lt/createdat
egt - ltupdatedategt2000-06-10 125407lt/updatedat
egt - ltparam name/eml/dataset/creator/individu
alName/surName"gtLevingslt/paramgt\ - ltparam name/eml/dataset/creator/individu
alName/surName"gtShriverlt/paramgt - ltparam name/eml/dataset/keywordSet/keywo
rd"gtstratalt/paramgt - ltparam name/eml/dataset/keywordSet/keywo
rd"gtmineralizationlt/paramgt - lt/documentgt
- ltdocumentgt
-
- lt/documentgt
- lt/resultsetgt
38Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
39Transforming a document
- Used to convert document before returning it
- Conversion uses XSLT
- Configuration of which style sheet to use is
controlled by the skin via the qformat
parameter - http//knb.ecoinformatics.org/knb/metacat?actionr
eaddocidknb-lter-gce.109.5qformatltss - Return document is converted and returned
- The skinname.xml file controls the mappings
that determine which style sheet to use
40Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
41Authentication subsystem
- Actions login and logout
- Simple username/password system
- Successful login creates a session
- Session ID tracked using an HTTP cookie
42Authentication plugins
- Delegates authentication requests to a
backend-service via a plugin - Lightweight Directory Access Protocol (LDAP)
- Replaceable to interface with other systems
- Metacat admin can choose which LDAP server to use
43Ecoinformatics.org LDAP
- Need for community-wide user identities
- Distributed system for participating institutions
- Root LDAP server refers requests to specific
organizations for authentication
dcecoinformatics,dcorg
oNCEAS
oUCNRS
oLTER
ounaffiliated
44Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Other subsystems
- Client API
45Other subsystems
- actionvalidate
- valtext, docid
- actionsetaccess
- docid, principal, permission, permType,
permOrder, principal - actiongetversion
- actiongetlog
- ipaddress, principal, docid, event, start, end
- http//68.111.43.2258080/knb/metacat?actiongetlo
geventinsert
46Roadmap
- Part I Introduction to Metacat capabilities
- Overview
- Metacat web interface
- Registries and Repositories
- Features
- Questions
- Part II Metacat Design and Architecture
- Architecture overview
- Storage subsystem
- Query subsystem
- Transformation subsystem
- Authentication subsystem
- Validation subsystem
- Client API
47Client API
- Application Programming Interface (API)
- Defines language-specific binding for
communicating with Metacat - Available in Java and Perl (partial python)
- Allows development of new applications
- Allows integration of metacat with existing
applications - Simple set of method calls
48Basic Client API
- public String login(String username, String
password) - public String logout()
- public Reader read(String docid)
- public Reader query(Reader xmlQuery)
- public String insert(String docid, Reader
xmlDocument, Reader schema) - public String update(String docid, Reader
xmlDocument, Reader schema) - public String delete(String docid)
- public String upload(String docid, File file)
- public String upload(String docid, String
fileName, InputStream fileData, int size)
49Example use of client
- String metacatUrl "http//foo.com/context/metaca
t" - String username "uidjones,oNCEAS,dcecoinforma
tics,dcorg" - String password "neverHarcodeAPasswordInCode"
- try
- Metacat m MetacatFactory.createMetacatConnectio
n(metacatUrl) -
- m.login(username, password)
- Reader r m.read("testdocument.1.1")
- // Do whatever you want with Reader r
- catch (MetacatAuthException mae)
- handleError("Authorization failed\n"
mae.getMessage()) - catch (MetacatInaccessibleException mie)
- handleError("Metacat Inaccessible\n"
mie.getMessage()) - catch (Exception e)
- handleError("General exception\n"
e.getMessage())
50Overview
51Documentation
52Acknowledgements
This material is based upon work supported
by The National Science Foundation under Grant
Numbers 9980154, 9904777, 0131178, 9905838,
0129792, and 0225676. The National Center for
Ecological Analysis and Synthesis, a Center
funded by NSF (Grant Number 0072909), the
University of California, and the UC Santa
Barbara campus. The Andrew W. Mellon
Foundation. PBI Collaborators NCEAS, University
of New Mexico (Long Term Ecological Research
Network Office), San Diego Supercomputer Center,
University of Kansas (Center for Biodiversity
Research) Kepler contributors SEEK, Ptolemy II,
SDM/SciDAC, GEON