Title: Scientific Data
1Scientific Data Workflow EngineeringPreliminary
Notes from the Cyberinfrastructure Trenches
Associate Professor Dept. of Computer Science
Genome Center University of California, Davis
Fellow San Diego Supercomputer Center University
of California, San Diego
2Outline
- Introduction CI Sample Architectures
- Scientific Data Integration
- Scientific Workflow Management
- Links Crystallization Points
- Lessons learnt Summary
3Science Environment for Ecological Knowledge
(SEEK) Overview
- Domain Science Driver
- Ecology (LTER), biodiversity,
- Analysis Modeling System
- Design execution of ecological models
analysis (scientific workflows) - application,upper-ware
- ? Kepler system
- Semantic Mediation System
- Data Integration of hard-to-relate sources and
processes - Semantic Types and Ontologies
- upper middleware
- ? Sparrow Toolkit
- EcoGrid
- Access to ecology data and tools
- middle,under-ware
- ? unified API to SRB/MCAT, MetaCat, DiGIR,
datasets
sample CS problem DILS04
4Common CI Infrastructure Pieces
- Other CI-projects (e.g. GEON, ) have similar
service-oriented architectures - Seamless and uniform data access (Data-Grid)
- data metadata registry
- distributed and high performance computing
platform (Compute-Grid) - service registry
- Federated, integrated, mediated databases
- often use of semantic extensions (e.g.
ontologies) - User-friendly workbench / problem-solving
environment - ? scientific workflows
- add to this sensors, observing systems
5 Example Realtime Environment for Analytical
Processing (REAP vision)
6The Great Unified System
- Many engineering and CS challenges!
- well see some
- Our focus
- Scientific data integration
- How to associate, mediate, integrate complex
scientific data? - Scientific workflows
- How to devise larger scientific workflows for
process automation from individual components
(e.g. web services)? - Disclaimer
- often scratching the surface see references
research literature for details
7Outline
- Introduction CI Sample Architectures
- Scientific Data Integration
- Scientific Workflow Management
- Links Crystallization Points
- Lessons learnt Summary
8An Online Shoppers Information Integration
Problem
El Cheapo Where can I get the cheapest copy
(including shipping cost) of Wittgensteins
Tractatus Logicus-Philosophicus within a week?
One-World Mediation
Mediator (virtual DB) (vs. Datawarehouse) NOTE
non-trivial data engineering challenges!
9A Home Buyers Information Integration Problem
What houses for sale under 500k have at least 2
bathrooms, 2 bedrooms, a nearby school ranking
in the upper third, in a neighborhood with
below-average crime rate and diverse population?
Multiple-Worlds Mediation
10A Neuroscientists Information Integration Problem
Biomedical Informatics Research
Network http//nbirn.net
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
Complex Multiple-Worlds Mediation
- Inter-source links
- unclear for the non-scientists
- hard for the scientist
11(No Transcript)
12Interoperability Integration Challenges
- System aspects Grid Middleware
- distributed data computing, SOA
- web services, WSDL/SOAP, WSRF, OGSA,
- sources functions, files, data sets
- Syntax Structure
- (XML-Based) Data Mediators
- wrapping, restructuring
- (XML) queries and views
- sources (XML) databases
- Semantics
- Model-Based/Semantic Mediators
- conceptual models and declarative views
- Knowledge Representation ontologies, description
logics (RDF(S),OWL ...) - sources knowledge bases (DBCMsICs)
- Synthesis Scientific Workflow Design Execution
- Composition of declarative and procedural
components into larger workflows - (re)sources services, processes, actors,
- reconciling S5 heterogeneities
- gluing together resources
- bridging information and knowledge gaps
computationally
13Information Integration Challenges S4
Heterogeneities
- System aspects
- platforms, devices, data service distribution,
APIs, protocols, - ? Grid middleware technologies
- e.g. single sign-on, platform independence,
transparent use of remote resources, - Syntax Structure
- heterogeneous data formats (one for each tool
...) - heterogeneous data models (RDBs, ORDBs, OODBs,
XMLDBs, flat files, ) - heterogeneous schemas (one for each DB ...)
- ? Database mediation technologies
- XML-based data exchange, integrated views,
transparent query rewriting, - Semantics
- descriptive metadata, different terminologies,
hidden semantics (context), implicit
assumptions, - ? Knowledge representation semantic mediation
technologies - smart data discovery integration
- e.g. ask about X (mafic) find data about Y
(diorite) be happy anyways!
14Information Integration Challenges S5
Heterogeneities
- Synthesis of applications, analysis tools, data
query components, into scientific workflows - How to make use of these wonderful things put
them together to solve a scientists problem? - ? Scientific Problem Solving Environments (PSEs)
- Portals,Workbench (scientists view)
- ontology-enhanced data registration, discovery,
manipulation - creation and registration of new data products
from existing ones, - Scientific Workflow System (engineers view)
- for designing, re-engineering, deploying
analysis pipelines and scientific workflows a
tool to make new tools - e.g., creation of new datasets from existing
ones, dataset registration,
15Information Integration from a Database
Perspective
- Information Integration Problem
- Given data sources S1, ..., Sk (databases, web
sites, ...) and user questions Q1,..., Qn that
can in principle be answered using the
information in the Si - Find the answers to Q1, ..., Qn
- The Database Perspective source database
- Si has a schema (relational, XML, OO, ...)
- Si can be queried
- define virtual (or materialized) integrated (or
global) view G over local sources S1 ,..., Sk
using database query languages (SQL, XQuery,...) - questions become queries Qi against G(S1,..., Sk)
16Standard (XML-Based) Mediator Architecture
USER/Client
3. Q1 Q2 Q3
4. answers(Q1)
answers(Q2) answers(Q3)
17Query Planning in Data Integration
- Given
- Declarative user query Q answer() ? G ...
- G ? S global-as-view (GAV)
- S ? G local-as-view (LAV)
- ic() ? S G integrity constraints
(ICs) - Find
- equivalent (or minimal containing, maximal
contained) - query plan Q answer() ? S
- ? query rewriting (logical/calculus, algebraic,
physical levels) - Results
- A variety of results/algorithms depending on
classes of queries, views, and ICs P, NP, ,
undecidable - hot research area in core CS (database community)
18Scientific Data Integration using Semantic
Extensions
19(No Transcript)
20Example Geologic Map Integration
- Given
- Geologic maps from different state geological
surveys (shapefiles w/ different data schemas) - Different ontologies
- Geologic age ontology (e.g. USGS)
- Rock classification ontologies
- Multiple hierarchies (chemical, fabric, texture,
genesis) from Geological Survey of Canada (GSC) - Single hierarchy from British Geological Survey
(BGS) - Problem
- Support uniform queries across all map
- using different ontologies
- Support registration w/ ontology A, querying w/
ontology B
21Schema Integration (registering local schemas
to the global schema)
ABBREV
Arizona
PERIOD
FORMATION
Idaho
AGE
NAME
Colorado
PERIOD
LITHOLOGY
Utah
TYPE
PERIOD
Nevada
FMATN
TIME_UNIT
Wyoming
NAME
Livingston formation
FORMATION
PERIOD
Tertiary-Cretaceous
Montana West
AGE
New Mexico
NAME
PERIOD
LITHOLOGY
andesitic sandstone
Montana E.
FORMATION
PERIOD
22Multihierarchical Rock Classification Ontology
(Taxonomies) for Thematic Queries (GSC)
Genesis
Fabric
Composition
Texture
23Ontology-Enabled Application ExampleGeologic
Map Integration
24Querying by Geologic Age
25Querying by Geologic Age Results
26Querying by Chemical Composition (GSC)
27Semantic Mediation (via semantic registration
of schemas and ontology articulations)
- Schema elements and/or data values are associated
with concept expressions from the target ontology - ? conceptual queries through the ontology
- Articulation ontology
- ? source registration to A, querying through B
- Semantic mediation query rewriting w/ ontologies
semantic registration
Ontology A
Database1
Concept-based (semantic) queries
ontology articulations
Ontology B
Database2
semantic registration
28Different views on State Geological Maps
29Sedimentary Rocks BGS Ontology
30Sedimentary Rocks GSC Ontology
31Implementation in OWL Not only for the machine
32Source Contextualization through Ontology
Refinement
- sources can register new concepts at the
mediator ...
33Outline
- Introduction CI Sample Architectures
- Scientific Data Integration
- Scientific Workflow Management
- Links Crystallization Points
- Lessons learnt Summary
34What is a Scientific Workflow (SWF)?
- Goals
- automate a scientists repetitive data management
and analysis tasks - typical phases
- data access, scheduling, generation,
transformation, aggregation, analysis,
visualization - ? design, test, share, deploy, execute, reuse,
SWFs
35Promoter Identification Workflow
Source Matt Coleman (LLNL)
36Source NIH BIRN (Jeffrey Grethe, UCSD)
37Ecology GARP Analysis Pipeline for Invasive
Species Prediction
Source NSF SEEK (Deana Pennington et. al, UNM)
38(No Transcript)
39Commercial Open Source Scientific Workflow
(well Dataflow) Systems
Kensington Discovery Edition from InforSense
Triana
Taverna
40SCIRun Problem Solving Environments for
Large-Scale Scientific Computing
- SCIRun PSE for interactive construction,
debugging, and steering of large-scale scientific
computations - Component model, based on generalized dataflow
programming
Steve Parker (cs.utah.edu)
41Ptolemy II
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
42Why Ptolemy II (and thus KEPLER)?
- Ptolemy II Objective
- The focus is on assembly of concurrent
components. The key underlying principle in the
project is the use of well-defined models of
computation that govern the interaction between
components. A major problem area being addressed
is the use of heterogeneous mixtures of models of
computation. - Dataflow Process Networks w/ natural support for
abstraction, pipelining (streaming)
actor-orientation, actor reuse - User-Orientation
- Workflow design exec console (Vergil GUI)
- Application/Glue-Ware
- excellent modeling and design support
- run-time support, monitoring,
- not a middle-/underware (we use someone elses,
e.g. Globus, SRB, ) - but middle-/underware is conveniently accessible
through actors! - PRAGMATICS
- Ptolemy II is mature, continuously extended
improved, well-documented (500pp) - open source system
- Ptolemy II folks actively participate in KEPLER
43KEPLER/CSP Contributors, Sponsors, Projects(or
loosely coupled Communicating Sequential Persons
-)
- Ilkay Altintas SDM, Resurgence
- Kim Baldridge Resurgence, NMI
- Chad Berkley SEEK
- Shawn Bowers SEEK
- Terence Critchlow SDM
- Tobin Fricke ROADNet
- Jeffrey Grethe BIRN
- Christopher H. Brooks Ptolemy II
- Zhengang Cheng SDM
- Dan Higgins SEEK
- Efrat Jaeger GEON
- Matt Jones SEEK
- Werner Krebs, EOL
- Edward A. Lee Ptolemy II
- Kai Lin GEON
- Bertram Ludaescher SEEK, GEON, SDM, ROADNet, BIRN
- Mark Miller EOL
- Steve Mock NMI
- Steve Neuendorffer Ptolemy II
Ptolemy II
www.kepler-project.org
44KEPLER An Open Collaboration
- Initiated by members from NSF SEEK and DOE
SDM/SPA now several other projects (GEON,
Ptolemy II, EOL, Resurgence/NMI, ) - Open Source (BSD-style license)
- Intensive Communications
- Web-archived mailing lists
- IRC (!)
- Co-development
- via shared CVS repository
- joining as a new co-developer (currently)
- get a CVS account (read-only)
- local development contribution via existing
KEPLER member - be voted in as a member/co-developer
- Software social engineering
- How to better accommodate new groups/communities?
- How to better accommodate different
usage/contribution models (core dev special
purpose extender user)?
45Ptolemy II/KEPLER GUI (Vergil)
Directors define the component interaction
execution semantics
Large, polymorphic component (Actors) and
Directors libraries (drag drop)
46KEPLER/Ptolemy II GUI refined
Ontology based actor (service) and dataset search
Result Display
47Web Services ? Actors (WS Harvester)
1
2
4
3
- ? Minute-made (MM) WS-based application
integration - Similarly MM workflow design sharing w/o
implemented components
48Some Recent Actor Additions
49An early example Promoter Identification
SSDBM, AD 2003
- Scientist models application as a workflow of
connected components (actors) - If all components exist, the workflow can be
automated/ executed - Different directors can be used to pick
appropriate execution model (often pipelined
execution PN director)
50Reengineering a Geoscientists Mineral
Classification Workflow
51Job Management (here NIMROD)
- Job management infrastructure in place
- Results database under development
- Goal 1000s of GAMESS jobs (quantum mechanics)
Fall/Winter04
52(No Transcript)
53Rapid Web Service-based Prototyping (Here
ROADNet Command Control Services for LOOKING
Kick-Off Mtg)
Source Ilkay Altintas, SDM, NLADR ROADNet
Vernon, Orcutt et al Web services Tony Fountain
et al
54in KEPLER (w/ editable script)
Source Dan Higgins, Kepler/SEEK
55in KEPLER (interactive session)
Source Dan Higgins, Kepler/SEEK
56Blurring Design (ToDo) and Execution
57Scientific Workflow Challenges
- Typical Features
- data-intensive and/or compute-intensive
- plumbing-intensive (consecutive web services
wont fit) - dataflow-oriented
- distributed (remote data, remote processing)
- user-interaction in the middle,
- vs. (C-z bg fg)-ing (detach and reconnect)
- advanced programming constructs (map(f), zip,
takewhile, ) - logging, provenance, registering back
(intermediate) products
58designed to fit
hand-crafted control solution also forces
sequential execution!
designed to fit
Altintas-et-al-PIW-SSDBM03
hand-crafted Web-service actor
No data transformations available
Complex backward control-flow
59A Scientific Workflow Problem Solved (Computer
Scientists view)
- Solution based on declarative, functional
dataflow process network - ( also a data streaming model!)
- Higher-order constructs map(f)
- no control-flow spaghetti
- data-intensive apps
- free concurrent execution
- free type checking
- automatic support to go from piw(GeneId) to
- PIW map(piw) over GeneId
map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
60Promoter Identification Workflow Redesigned
map(GenbankWS) Input NM_001924,
NM020375 Output CAGTAATATGAC",GGGGACAA
AGA
61A Research Problem Optimization by Rewriting
- Example PIW as a declarative, referentially
transparent functional process - optimization via functional rewriting possible
- e.g. map(f o g) map(f) o map(g)
- Technical report PIW specification in Haskell
map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-cons
tructs.pdf
62A KRDIScientific Workflow Problem
- Services can be semantically compatible, but
structurally incompatible
Ontologies (OWL)
Compatible
(?)
SemanticType Ps
SemanticType Pt
Incompatible
StructuralType Pt
StructuralType Ps
(?)
?
?(Ps)
Source Service
Target Service
Desired Connection
Pt
Ps
Source Bowers-Ludaescher, DILS04
63Ontology-Informed Data Transformation
(Structure-Shim)
Ontologies (OWL)
Compatible
(?)
SemanticType Ps
SemanticType Pt
Registration Mapping (Input)
Registration Mapping (Output)
StructuralType Pt
StructuralType Ps
Correspondence
?(Ps)
Generate
Source Service
Target Service
Transformation
Pt
Ps
Desired Connection
Source Bowers-Ludaescher, DILS04
64Outline
- Introduction CI Sample Architectures
- Scientific Data Integration
- Scientific Workflow Management
- Links Crystallization Points
- Lessons learnt Summary
65Link-Up Crystallization Points
- Shared (Domain) Science Vision, Goals
- NVO, SCEC, Human Genome Project,
- Technology Waves
- XML, web services, WSRF, Semantic Web (OWL),
Portlets, - Standards for data exchange, metadata, data
access protocols, - GML, EML, netCDF, HDF, , ADN, , DODS/OpenDAP,
- Organizations W3C, GGF, ,
- Community ontologies
- GO (Gene Ontology), ecoinformatics, seismology,
geochemistry, - from Saulus to Paulus
- Shared Community Tools and Tool Co-Development
- SRB, Globus, , Kepler,
66Shared Science Vision, Goals SCEC/CME
Southern California Earthquake Center / Community
Modeling Environment Project
- Simulation of Seismic Wave Propagation of a
Magnitude 7.7 Earthquake on San Andreas Fault - PIs Thomas Jordan, Bernard Minster, Reagan
Moore, Carl Kesselman - Simulation
- 240 Processors for 5 days
- 47 Terabytes of data generated
- SDSC SAC project optimized code on DataStar
parallel computer (both MPI I/O management and
checkpointing) - Future simulation Increase resolution a factor
of 2, implies 1 PB of simulation results, 1000
processors for 20 days
Source Reagan Moore, SDSC
67Example NVO Community Processes
- - created standard data encoding format (FITS
image format) - - made accessible common digital holdings (sky
survey images) - - defined Uniform Content Descriptors (common
metadata attributes) - - created standard services (standard access
mechanisms to catalogs - and surveys)
- - created digital library (manage derived data
products) - - created portals (for combining services
interactively) - - created processing pipelines (for automated
processing) - - created preservation environment
- Broader impact found a new star!
Source Reagan Moore, SDSC
68Semantic Mediation Waterfall
Ontologies
Iterative Development
Semantic Data, Service Annotation
Resource Discovery
Resource Integration
Workflow Analysis
Workflow Planning
Source Shawn Bowers, SEEK AHM04
69GEON Dataset Generation Registration(a
co-development in KEPLER)
Makefile gt ant run
SQL database access (JDBC)
Matt,Chad, Dan et al. (SEEK)
Efrat (GEON)
Ilkay (SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
70KEPLER as a Melting Pot
- A grass-roots project
- Needed a coalition of the (really!) willing
- Inter-project links
- SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM,
Ptolemy II, NIH BIRN (coming ), UK eScience
myGrid, - Intra-project links
- e.g. in SEEK AMS ? SMS ? EcoGrid
- Inter-technology links
- Globus, SRB, JDBC, web services, soaplab
services, command line tools, R, GRASS, XSLT, - Interdisciplinary links
- CS, IT, domain sciences, (recently usability
engineer)
71Outline
- Introduction CI Sample Architectures
- Scientific Data Integration
- Scientific Workflow Management
- Links Crystallization Points
- Lessons learnt Summary
72Some Lessons Learnt
- Eat your own dog-food (or at least try)
- start using your own (CI) tools early
- Collaboration tools
- CVS repositories (cvsview, webcvs)
- Mailing lists (e.g. mailman ? googlified)
- Bugzilla (detailed tracking of tech. issues
bugs) - Wiki (community authored web resource, e.g.
high-level tech. issues) - Where is the XYZ repository/registry?
- EcoGrid (SEEK) registry, GEON registry, KEPLER
actor datasets repository, - UDDI what?
- CI Melting Pots SDSC,
- NCEAS, LTER, NLADR (w/ NCSA), KU Specify,
73Q A
74Further Reading
75Related Publications
- Semantic Data Registration and Integration
- On Integrating Scientific Resources through
Semantic Registration, S. Bowers, K. Lin, and B.
Ludäscher, 16th International Conference on
Scientific and Statistical Database Management
(SSDBM'04), 21-23 June 2004, Santorini Island,
Greece. - A System for Semantic Integration of Geologic
Maps via Ontologies, K. Lin and B. Ludäscher. In
Semantic Web Technologies for Searching and
Retrieving Scientific Data (SCISW), Sanibel
Island, Florida, 2003. - Towards a Generic Framework for Semantic
Registration of Scientific Data, S. Bowers and B.
Ludäscher. In Semantic Web Technologies for
Searching and Retrieving Scientific Data (SCISW),
Sanibel Island, Florida, 2003. - The Role of XML in Mediated Data Integration
Systems with Examples from Geological (Map) Data
Interoperability, B. Brodaric, B. Ludäscher, and
K. Lin. In Geological Society of America (GSA)
Annual Meeting, volume 35(6), November 2003. - Semantic Mediation Services in Geologic Data
Integration A Case Study from the GEON Grid, K.
Lin, B. Ludäscher, B. Brodaric, D. Seber, C.
Baru, and K. A. Sinha. In Geological Society of
America (GSA) Annual Meeting, volume 35(6),
November 2003. - Query Planning and Rewriting
- Processing First-Order Queries under Limited
Access Patterns, Alan Nash and B. Ludäscher,
Proc. 23rd ACM Symposium on Principles of
Database Systems (PODS'04) Paris, France, June
2004. - Processing Unions of Conjunctive Queries with
Negation under Limited Access Patterns, Alan Nash
and B. Ludäscher., 9th Intl. Conference on
Extending Database Technology (EDBT'04)
Heraklion, Crete, Greece, March 2004, LNCS 2992. - Web Service Composition Through Declarative
Queries The Case of Conjunctive Queries with
Union and Negation, B. Ludäscher and Alan Nash.
Research abstract (poster), 20th Intl. Conference
on Data Engineering (ICDE'04) Boston, IEEE
Computer Society, April 2004.
76Related Publications
- Scientific Workflows
- Kepler An Extensible System for Design and
Execution of Scientific Workflows, I. Altintas,
C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, S.
Mock, 16th International Conference on Scientific
and Statistical Database Management (SSDBM'04),
21-23 June 2004, Santorini Island, Greece. - Kepler Towards a Grid-Enabled System for
Scientific Workflows, Ilkay Altintas, Chad
Berkley, Efrat Jaeger, Matthew Jones, Bertram
Ludäscher, Steve Mock, Workflow in Grid Systems
(GGF10), Berlin, March 9th, 2004. - An Ontology-Driven Framework for Data
Transformation in Scientific Workflows, S. Bowers
and B. Ludäscher, Intl. Workshop on Data
Integration in the Life Sciences (DILS'04), March
25-26, 2004 Leipzig, Germany, LNCS 2994. - A Web Service Composition and Deployment
Framework for Scientific Workflows, I. Altintas,
E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In
the 2nd Intl. Conference on Web Services (ICWS),
San Diego, California, July 2004.