Title: Semantic Mediation in SEEK/Kepler: Exploiting Semantic Annotation for Discovery, Analysis, and Integration of Scientific Data and Workflows
1Semantic Mediation in SEEK/Kepler Exploiting
Semantic Annotation for Discovery, Analysis, and
Integration of Scientific Data and Workflows
Shawn Bowers UC Davis Genome Center sbowers _at_
ucdavis.edu
- Bertram Ludäscher
- Dept. of Computer Science, UC Davis
- UC Davis Genome Center
- ludaesch _at_ ucdavis.edu
seek.ecoinformatics.org kepler-project.org
www.sdsc.edu dbis.ucdavis.edu
genomics.ucdavis.edu
2Science Environment for Ecological Knowledge
- SEEK is an NSF-funded, multidisciplinary research
project to facilitate - Access to distributed ecological, environmental,
and biodiversity data - Enable data sharing reuse
- Enhance data discovery at global scales
- Scalable analysis and synthesis
- Taxonomic, spatial, temporal, conceptual
integration of data, addressing data
heterogeneity issues - Enable communication and collaboration for
analysis - Enable reuse of analytical components
- Support scientific workflow design and modeling
3SEEK data access, analysis, mediation
- Data Access (EcoGrid)
- Distributed data network for environmental,
ecological, and systematics data - Interoperate diverse environmental data systems
- Workflow Tools (Kepler)
- Problem-solving environment for scientific data
analysis and visualization ? scientific
workflows - Semantic Mediation (SMS)
- Leverage ontologies for smartdata/component
discovery and integration
4Managing Data Heterogeneity
- Data comes from heterogeneous sources
- Real-world observations
- Spatial-temporal contexts
- Collection/measurement protocols and procedures
- Many representations for thesame information
(count, area, density) - Data, Syntax, Schema, Semantic heterogeneity
- Discovery and synthesis (integration) performed
manually - Discovery often based on intuitive notion of
what is out there - Synthesis of data is very time consuming, and
limits use
5Scientific workflow systems support data analysis
KEPLER
6A simple Kepler workflow
Composite Component (Sub-workflow)
Loops often used in SWFs e.g., in genomics and
bioinformatics (collections of data, nested data,
statistical regressions, ...)
(T. McPhillips)
7A simple Kepler workflow
Lists Nexus filesto process (project)
Reads text files
Parses Nexus format
Draws phylogenetic trees
PhylipPars infers trees from discrete,
multi-state characters.
Workflow runs PhylipPars iteratively to discover
all of the most parsimonious trees.
UniqueTrees discards redundant trees in each
collection.
(T. McPhillips)
8A simple Kepler workflow
An example workflow run, executed as a Dataflow
Process Network
9SMS motivation
- Scientific Workflow Life-cycle
- Resource Discovery
- discover relevant datasets
- discover relevant actors or workflow templates
- Workflow Design and Configuration
- data ? actor (data binding)
- data ? data (data integration / merging /
interlinking) - actor ? actor (actor / workflow
composition) - Challenge do all this in the presence of
- 100s of workflows and templates
- 1000s of actors (e.g. actors for web services,
data analytics, ) - 10,000s of datasets
- 1,000,000s of data items
- highly complex, heterogeneous data
price to pay for these resources (lots)
scientists time wasted priceless!
10Approach SMS capabilities
Ontologies
Iterative Development
SemanticAnnotation
Resource Discovery
Workflow Validation
Resource Integration
Workflow Elaboration
11Approach SMS capabilities
Ontologies
Iterative Development
SemanticAnnotation
Resource Discovery
Workflow Validation
Resource Integration
Workflow Elaboration
- SEEK KR group is developing OWL-DL ontologies
- Various workflow-component ontologies (for
categorizing by function, project, scientific
discipline, ) - Scientific observation ontology (OBOE), an upper
ontology for defining and relating observations,
measurements, and units - Domain specific ontologies that extend OBOE
(standard and derived units, ecology and
biodiversity concepts, )
12Approach SMS capabilities
Ontologies
Iterative Development
SemanticAnnotation
Resource Discovery
Workflow Validation
Resource Integration
Workflow Elaboration
- Annotations connect resources to ontologies
- Conceptually describe a resource and/or its data
schema - Annotations provide the means for ontology-based
discovery, integration,
13Hybrid types Semantic Structural Typing
14Semantic Type Annotation in Kepler
- Component input and output port annotation
- Each port can be annotated with multiple classes
from multiple ontologies - Annotations are stored within the component
metadata
15Component Annotation and Indexing
- Component Annotations
- New components can be annotated and indexed into
the component library (e.g., specializing generic
actors) - Existing components can also be revised,
annotated, and indexed (hiding previous versions)
16Approach SMS capabilities
Ontologies
Iterative Development
SemanticAnnotation
Resource Discovery
Workflow Validation
Resource Integration
Workflow Elaboration
- Ontology-based smart search
- Find components by semantic types
- Find components by input/output semantic types
- Ontology-based query rewriting for
discovery/integration - Joint work with GEON project (see SSDBM-04,
SWDB-04)
17Smart Search
- Find a component (here an actor) in different
locations (categories) - based on the semantic annotation of the
component (or its ports)
18Searching in context
- Search for components with compatible
input/output semantic types - searches over actor library
- applies subsumption checking on port annotations
19Approach SMS capabilities
Ontologies
Iterative Development
SemanticAnnotation
Resource Discovery
Workflow Validation
Resource Integration
Workflow Elaboration
- Workflow validation and analysis
- Check that workflows are semantically
structurally well-typed - Infer semantic type annotations of derived data
(ie, type inference) - An initial approach and prototype based on
mapping composition (see QLQP-05) - User-oriented provenance
- Collect query data-lineage of WF runs (see
IPAW-06)
20Workflow validation in Kepler
- Statically perform semantic and structural type
checking
- Navigate errors and warnings within the workflow
- Search for and insert adapters to fix
(structural and semantic) errors
21Approach SMS capabilities
Ontologies
Iterative Development
SemanticAnnotation
Resource Discovery
Workflow Validation
Resource Integration
Workflow Elaboration
- Integrating and transforming data
- Merge (smart union) datasets
- Find mappings between data schemas for
transformation - data binding, component connections (see DILS-04)
22Smart (Data) Integration Merge
- Discover data of interest
- connect to merge actor
- compute merge
- align attributes via annotations
- open dialog for user refinement
- store merge mapping in MOML
- enjoy!
- your merged dataset
- almost, can be much more complicated
23Under the hood of Smart Merge
- Exploits semantic type annotations and ontology
definitions to find mappings between sources - Executing the merge actor results in an
integrated data product (via outer union)
a1
a3
a1a8
a4
a3a6
Merge
a6
a4
a8
24Approach SMS capabilities
Ontologies
Iterative Development
SemanticAnnotation
Resource Discovery
Workflow Validation
Resource Integration
Workflow Elaboration
- Workflow design support
- (Semi-) automatically combine resource
discovery, integration, and validation - Abstract ? Executable WF
- ongoing work!
Automated SWF Refinement
25Summary
- Outlook
- Ontologies and semantic anotations for WF design
reuse - Put ontologies to actual use in Kepler
- Continue to develop Kepler tools for annotation
(KR observation ontology), discovery,
integration, design, - Issues Challenges
- Tools/approaches for ontology (OWL) management,
organization, reasoning - Open source (distributed) ontology (OWL) storage
and reasoning - Tools and techniques for robust ontology
versioning, and extension - Acknowledgements
- Timothy McPhillips, Dave Thau (UC Davis)
- Mark Schildhauer, Josh Madin, Matt Jones (UCSB)
- Deana Pennington (UNM)
- Rich Williams (Microsoft Research)
- Ferdinando Villa, Sergey Krivov (UVM)