Title: Bertram Ludscher
1Scientific Workflows Towards a New Synthesis
for Information Integration
- Bertram Ludäscher
- Dept. of Computer Science Genome Center
- University of California, Davis
- ludaesch_at_ucdavis.edu
2Outline
- Demystifying eScience, Cyberinfrastructure (CI),
- Scientific Workflows and CI
- SWF Examples
- Scientific Workflow Design
- Actor-oriented Modeling Design (AMAD)
- Semantic types
- Collection-oriented Modeling Design (COMAD)
- SWF Provenance
- or Show me the evidence!
3The Diversity Unity of Science
Natural Sciences
Earth Sciences
Life Sciences
Physical Sciences
Observations, Measurements, Models, Simulations,
Analyses, Hypotheses Understanding, Prediction
in vivo, in vitro, in situ, in silico,
compute-intensive
structurally semantics -intensive
data-intensive
metadata-intensive
4How we do Science Src C. Goble ISWC05
5e-Science (UK) and Cyberinfrastructure (US)
- e-Science is about global collaboration in key
areas of science and the next generation of
computing infrastructure that will enable it." - Sir John Taylor, Director Office of Science and
Technology, UK - "Cyberinfrastructure is the coordinated aggregate
of software, hardware and other technologies, as
well as human expertise, required to support
current and future discoveries in science and
engineering. The challenge of Cyberinfrastructure
is to integrate relevant and often disparate
resources to provide a useful, usable, and
enabling framework for research and discovery
characterized by broad access and 'end-to-end'
coordination. - Fran Berman, San Diego Supercomputer Center, UCSD
6Example Current NSF CI Solicitation
Src Chris Greer, NSF Program Director
7 Integrated Cyberinfrastructure System meeting
the needs of multiple communities Source Dr.
Deborah Crawford, Chair, NSF CyberInfrastructure
Working Group
- Applications
- Environmental Science
- High Energy Physics
- Biomedical Informatics
- Geoscience
DevelopmentTools Libraries
Education and Training
Discovery Innovation
Grid Services Middleware
Hardware
8CI Tools Attempt at a simplified definition
- CI IT that scientists use to get their
(science) job done - Hardware dedicated cluster, supercomputer time,
the Grid, - Pubmed, Citeseer, GoogleScholar, (digital
libraries) - email, skype, (personal communication)
- Google, Wikipedia, (search browse)
- MySQL, PostgreSQL, Oracle, SQL Server, DB2,
(databases) - R, MatLab, Statistica, LabView, . (statistics,
data mining, ) - Web Portals, Workspaces, Workbenches,
- Virtual instruments, virtual experiments,
- Community aspect
- shared/standardized file formats, data,
metadata, ontologies, . - scientific workflows! (PSEs, LIMS ? Scientific
Workflow Systems)
9A Scientists view on Cyberinfrastructure
- In the beginning there was a lot of middleware
- compute/data-grids, web services, WS-foo,
SOA-bar, - running on a widely distributed underware
- i.e., distributed, heterogeneous compute
platforms on clusters, over LANs WANs,
database-backends, etc. - only topped by the all-important Upperware!
- i.e., the applications that users/scientists use
to get stuff done - Good CI at the middleware level and below
- Thou shalt be invisible!
- unless you are a workflow engineer
- Good CI at the upperware level
- Thou shalt help me get my job done!
- ? That is what scientific workflow systems shoot
for
10Scientific Workflows User Applications?
- Q So is this scientific workflow system
upperware then a fancy name for tools or user
apps? - A Tool, application, etc. are very generic
terms. They do not convey what SWF systems are
about, i.e., gluing/plumbing existing data
management, analysis, and visualization
components together - Q Doh! And I heard that Ontologies its the
GLUE STUPID! so what is now the glue?? - A Both ontologies are about gluing (defining)
your vocabulary, annotating (and thus
interlinking) data, etc. - Scientific workflows are about gluing components
(web services, legacy scripts, external tools
e.g. R, ). - ? SWFs are about application (and data)
integration ontologies are about sharing
meaning via controlled vocabularies logic
constraints (capturing some of the semantics)
this helps, e.g., in data integration
11Data Integration So much data, so little time
Tectonics
Earthquakes
Geology
Aquifers
Moho depth
Integration across (Sub-)Disciplines, Scales,
Time,
Topography
Gravity
Faults
Mines
Focal Mechanisms
Magnetics
Sediment thickness
Src Dogan Seber SDSC, Krishna Sinha VT
CYBERINFRASTRUCTURE FOR THE GEOSCIENCES
12Different Types of Information Integration
- Spatial (co-)registration/overlay of different
data - from 2D, 3D, 4D (x,y,z,t), (4n) D
- ? GIS !
- Conventional (DB-oriented) integration
- schema-based
- view-based
- at the data-level
- Extended DI approaches using ontologies
- ontologies? controlled vocabularies?
metadata/annotations? - Application/process integration
- ? scientific workflows
- can include all the others and
- statistics, data mining, visualization,
13Interoperability Integration Challenges
- System aspects Grid Middleware
- distributed data computing, SOA
- resource discovery, authentication, authorization
- web services, WSDL/SOAP, WSRF, OGSA,
- (re-)sources services, files, data sets, nodes
- Syntax Structure
- (XML-Based) Data Mediators
- wrapping, restructuring
- (XML) queries and views
- sources (XML) databases
- Semantics
- Model-Based/Semantic Mediators
- conceptual models and declarative views
- Knowledge Representation ontologies, description
logics (RDF(S),OWL ...) - sources knowledge bases (DBCMsICs)
- Synthesis Scientific Workflow Design Execution
- Composition of declarative and procedural
components into larger workflows - (re)sources services, processes, actors,
- reconciling S5 heterogeneities
- gluing together resources
- bridging information and knowledge gaps
computationally
14Information Integration Challenges S4
Heterogeneities
- System aspects
- platforms, devices, data service distribution,
APIs, protocols, - ? Grid middleware technologies
- e.g. single sign-on, platform independence,
transparent use of remote resources, - Syntax Structure
- heterogeneous data formats (one for each tool
...) - heterogeneous data models (RDBs, ORDBs, OODBs,
XMLDBs, flat files, ) - heterogeneous schemas (one for each DB ...)
- ? Database mediation technologies
- XML-based data exchange, integrated views,
transparent query rewriting, - Semantics
- descriptive metadata, different terminologies,
hidden semantics (context), implicit
assumptions, - ? Knowledge representation semantic mediation
technologies - smart data discovery integration
- e.g. ask about X (mafic) find data about Y
(diorite) be happy anyways!
15Information Integration Challenges S5
Heterogeneities
- Synthesis of applications, analysis tools, data
query components, into scientific workflows - How to put together components to solve a
scientists problem? - ? Scientific Problem Solving Environments (PSEs)
- Portals, Workbench (scientists view)
- ontology-enhanced data registration, discovery,
manipulation - creation and registration of new data products
from existing ones, - Scientific Workflow System (engineers view
scientists view) - for designing, re-engineering, deploying
analysis pipelines and scientific workflows a
tool to make new tools - e.g., creation of new data products from
existing ones, dataset registration,
16Motivation Scientific Workflows,
Pre-Cyberinfrastructure
- Data Federation Grid Plumbing
- access, move, replicate, query data (Data-Grid)
- authenticate SRB Sget/Sput OPeNDAP,
Antelope/ORBs - schedule, launch, monitor jobs (Compute-Grid)
- Globus, Condor, Nimrod, APST,
- Data Integration
- Conceptual querying integration, structure
semantics, e.g. mediation w/ SQL, XQuery OWL
(Semantics-enabled Mediator) ? SOQL - Data Analysis, Mining, Knowledge Discovery
- manual/scripts, Excel, R, simulations,
- Visualization
- 3-D (volume), 4-D (spatio-temporal), n-D
(conceptual views)
- one-of-a-kind custom apps., detached (island)
solutions - workflows are hard to reproduce, maintain
- no/little workflow design, automation, reuse,
documentation - need for an integrated scientific workflow
environment
17Scientific Workflow (SWF)
- A model of the way a scientist works with their
data and tools - Mentally coordinate data export, import,
analysis via software systems - Emphasize dataflow (similar but different
business workflows) - Metadata automatic data ingestion, analysis,
provenance tracking - Goals
- SWF automation
- SWF component reuse
- SWF design documentation
- make scientific data analysis
- and management tasks easier
- for the scientist!
18Types of Scientific Workflows (overlapping)
- What do we use scientific workflow systems for?
- Short answer
- nearly everything
- Types of Scientific Workflows
- Modeling Design Capture or reverse-engineer
processes and information flows at all levels - Knowledge discovery Automate repetitive data
access, retrieval, custom analysis (e.g. Blast),
generic steps (PCA, cluster analysis, ..), - Ex PIW, Motif analysis, NDDP,
- Plumbing Stage files, submit batch jobs,
monitor progress, move files off XT3 to analysis
and viz cluster, archive, steer computation, - Ex Fusion simulation, Astrophysics (supernova
simulation) - (Real-time) analysis pipelines processing of
environmental and earth science data from sensor
networks - (? NEON, ORION, Earthscope,)
19Promoter Identification Workflow (PIW)
or from a napkin
drawing
Source Matt Coleman (LLNL)
20 to an executable workflow (here in KEPLER)
21Retrieving gene sequences via web services
This entire workflow can be wrapped as a
re-usable component so that the details of
extracting sequence data are hidden unless
needed.
22Managing complexity Actor-oriented Modeling
Design
- Scientific workflows use hierarchy to hide
complexity - Top level workflows can be a conceptual
representation of the science process that is
easy to comprehend at a glance - Drilling down into sub-workflows reveals
increasing levels of detail - Composing models using hierarchy promotes the
development of re-usable components that can be
shared with other scientists
23Ex A Happy Fusion Simulation Workflow
Subspecies/Variety Plumbing WF (flux-laboris
plumbiensis)
- Implements concurrent analysis pipeline (_at_2ndary
cluster) - Tasks convert analyze copy-to-Web-portal
(makes scientists really happy!) - easy configuration, reuse,
- pipeline parallelism!
Reusable Actor Class
Pipelined Execution Model
Inline Documentation
Inline Display
Easy-to-edit Parameter Settings
Checkpointing for (semi-smart) restart
Overall architecture/simulation (physicist)
Scott Klasky (ORNL) Workflow design
development Norbert Podhorszki (UC Davis)
24(No Transcript)
25Commercial Open Source Scientific Workflow
Systems
26KEPLER http//www.kepler-project.
org
- The Kepler Scientific Workflow System
- Extends Ptolemy II (Berkeley), developed by a
EECS community (design and simulation of complex
systems) - Open-source, Java
- Computation Models, Nested WFs, Loops
- Graphical Workflow Interface
- Workflow Execution
- Extensible Architecture
- Component Libraries
- Metadata, Discovery, Archival
- The Kepler Vision
- End-to-end scientific workflow design and
execution environment - Data- and compute-intensive workflows
- Comprehensive component libraries for a wide
range of scientific domains - Enable collaboration, sharing across disciplines
(synergy)
Natural Diversity Discovery Project
Science Environment for Ecological Knowledge
Real-time Observatories Applications and Data
Management Network
Ptolemy II
Encyclopedia of Life
KEPLER
NDDP www.nddp.org GEON www.geongrid.org Ptolemy
II ptolemy.eecs.berkeley.edu/ptolemyII ROADNet roa
dnet.ucsd.edu SEEK seek.ecoinformatics.org SciDAC
www-casc.llnl.gov/sdm
27GEON Dataset Generation Registration(a
co-development in KEPLER)
Makefile gt ant run
SQL database access (JDBC)
Matt et al. (SEEK)
Efrat (GEON)
Ilkay (SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
28Web Services ? Actors (WS Harvester)
1
2
4
3
- ? Minute-made (MM) WS-based application
integration - Similarly MM workflow design sharing w/o
implemented components
29GEON Mineral Classification Workflow
An early example Classification for naming
Igneous Rocks
30Ex Pipelined workflow for inferring phylogenetic
trees
Src Timothy McPhillips (UC Davis)
31Kepler SWF using remote datasets, 3rd-party
software
32Summary What are Scientific Workflows?
- Maybe the single-most important concept you hear
about _at_CSIG06 - Attempt at a Definition
- SWFs System designs and/or executable
programs/scripts - aiming to solve complex scientific data
integration, analysis, management, visualization
tasks - in plainer English doing hard and/or messy stuff
- while doing it in a scientist-friendly way
- that is making it look easy
- with the ultimate goal to
- do new, more, and better (e-)Science,
- and faster! (and reproducibler! -)
- In short SWFs are nothing less than MIRACLE-IT
to make scientists (biologists, physicists,
geoscientists, ... ) happy. - Attempto-DL Definition
33What about scripts instead of SWFs?
- If WF automation and gluing is so important,
why not just use - MIRACLE-Perl
- or MIRACLE-Python
- or MIRACLE-BPEL4WS ???
- indeed Perl/Python in the hand of a gifted
programmer are hard to beat - but (MIRACLE-) Scientific Workflows may offer
new features - built-in task parallelism and
pipeline-parallelism - data parallelism available upon request -)
- built-in distributed execution (Grid/cluster)
- modeling design of workflows
- actor (component)-oriented workflow design
Bowers-Ludaescher-ER05 - component and workflow reuse repurposing
- semantic extensions (smart search/link/)
Bowers-Ludaescher-QLQP06 - parameter configuration, parameter studies
- comprehensibility, documentation
- data (and workflow) provenance support (?
IPAW06 papers) - explain data dependencies/lineage, debug
strange results, smart rerun,
34Some KEPLER Actors (oh, the good old days )
35So ...
- a question If scientific workflows are so
great, why havent they taken over the world?? - A1 just wait
- A2 they already have
- A3 The problem of creating flexible, reusable,
comprehensible, efficient, workflows - is akin to the problem of creating modular,
reusable, maintainable, software! - its complex systems engineering (as in
difficult) - and using UML, XML, WS-foo, SOA-bar, and
BPEL-baz are no substitute for solving your
modeling design problem! - Google evolution of language Descartes,
Church, McCarthy, W3C - Tony Hoare (Turing Award winner) The Emperors
Old Clothes
36Modeling Design of Scientific workflows
Src Kristian Stevens, UC Davis
37 the inside may be less pretty
Src Kristian Stevens, UC Davis
38Actor-Oriented Modeling
- Ports
- each actor has a set of input and output ports
- denote the actors signature
- produce/consume data (a.k.a. tokens)
- parameters are special static ports
39Actor-Oriented Modeling
- Dataflow Connections
- actor communication channels
- directed (hyper) edges
- connect output ports with input ports
- merge step distribute step
40Actor-Oriented Modeling
- Sub-workflows / Composite Actors
- composite actors wrap sub-workflows
- like actors, have signatures (i/o ports of
sub-workflow) - hierarchical workflows (arbitrary nesting levels)
41Actor-Oriented Modeling
- Directors
- define the execution semantics of workflow graphs
- executes workflow graph (some schedule)
- sub-workflows may have different directors
- enables reusability
42Models of Computation
- Directors separate the concerns of WF
orchestration from Actor execution - Synchronous Dataflow (SDF)
- Connections have queues for sending/receiving
fixed numbers of tokens at each firing. Schedule
is statically predetermined. SDF models are
highly analyzable and used often in SWFs. - Process Networks (PN)
- Generalize SDF. Actors execute as a separate
thread/process, with queues of unbounded size.
Related to Kahn/MacQueen semantics. - Continuous Time (CT)
- Connections represent the value of a continuous
time signal at some point in time ... Often used
to model physical processes. - Discrete Event (DE)
- Actors communicate through a queue of events in
time. Used for instantaneous reactions in
physical systems.
43Polymorphic Actors Components WorkingAcross
Data Types and Domains
- Actor Data Polymorphism
- Add numbers (int, float, double, Complex)
- Add strings (concatenation)
- Add complex types (arrays, records, matrices)
- Add user-defined types
- Actor Behavioral Polymorphism
- In dataflow, add when all connected inputs have
data - In a time-triggered model, add when the clock
ticks - In discrete-event, add when any connected input
has data, and add in zero time - In process networks, execute an infinite loop in
a thread that blocks when reading empty inputs - In CSP, execute an infinite loop that performs
rendezvous on input or output - In push/pull, ports are push or pull (declared or
inferred) and behave accordingly - In real-time CORBA, priorities are associated
with ports and a dispatcher determines when to
add - hey, Ptolemy has been out for long!
By not choosing among these when defining the
component, we get a huge increment in component
re-usability. But how do we ensure that the
component will work in all these circumstances?
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/
44Supporting WF Design via Semantic Types
- Src NSF/ITR SEEK proposal
45Workflows
SMS
Ontologies
Data
46Semantic Types Motivation
- Scientific Workflow Life-cycle
- Resource Discovery
- discover relevant datasets
- discover relevant actors or workflow templates
- Workflow Design and Configuration
- data ? actor (data binding)
- data ? data (data integration / merging /
interlinking) - actor ? actor (actor / workflow
composition) - Challenge do all this in the presence of
- 100s of workflows and templates
- 1000s of actors (e.g. actors for web services,
data analytics, ) - 10,000s of datasets
- 1,000,000s of data items
- highly complex, heterogeneous data
price to pay for these resources (lots)
scientists time wasted priceless!
47Approach SMS Capabilities
- Employ semantic extensions (ontologies) for ..
- Smart Search (? Resource Discovery)
- Smart Attach (? Data Binding)
- Smart Integration (? Transform/Merge Data)
- Smart Links (? Actor Composition, WF Design)
- by smart we mean these services are informed
by metadata and ontology information - Characteristics of SMS work
- big chunk of basic computer science research (?
references) - but also implement this (? link to Kepler)
- driven by real-world use cases (? link to BEAM)
- on top of community ontologies (? link to KR
team)
48Example Semantic Type Annotation
MeasContext
Observation
hasContext
appliesTo
LifeStage Property
11
11
semType(P3)
Abundance Count
itemMeasured
Number Value
hasCount
11
11
11
?
hasValue
obsProperty
semType(P2)
AccuracyQualifier
11
Workflow Land
P2
P3
P5
S1(life stage property)
S2(mortality rate for period)
P1
P4
49Hybrid Types Semantic Structural Typing
50Semantic Type Annotation in Kepler
- Component input and output port annotation
- Each port can be annotated with multiple classes
from multiple ontologies - Annotations are stored within the component
metadata
51Component Annotation and Indexing
- Component Annotations
- New components can be annotated and indexed into
the component library (e.g., specializing generic
actors) - Existing components can also be revised,
annotated, and indexed (hiding previous versions)
52Smart Search
- Find a component (here an actor) in different
locations (categories) - based on the semantic annotation of the
component (or its ports) - ? needs one (or more) ontologies to register
against (? KR)
53Smart (Data) Integration Merge
- Discover data of interest
- connect to merge actor
- compute merge
- align attributes via annotations
- open dialog for user refinement
- store merge mapping in MOML
- enjoy!
- your merged dataset
54Under the hood of Smart Merge
- Exploits semantic type annotations and ontology
definitions to find mappings between sources - Executing the merge actor results in an
integrated data product (via outer union)
a1
a3
a1a8
a4
a3a6
Merge
a6
a4
a8
55Smart Linking (Workflow Design)
- Statically perform semantic and structural type
checking
- Navigate errors and warnings within the workflow
- Search for and insert adapters to fix
(structural and semantic) errors
56Smart Linking (Data-Actor, Actor-Actor)
Source Bowers-Ludaescher, DILS04
57SEEK SMS Summary
- Employ semantic extensions (ontologies) for ..
- Smart Search (? Resource Discovery)
- Smart Attach (? Data Binding)
- Smart Integration (? Merge Actor)
- Smart Links (? Actor Composition)
58Scientific Workflow Design Challenges
And thats why our scientific workflows are
much easier to develop, understand and maintain!
59A Simple Motivating Example
- Take the services (actors, components) in (a)
- and chain them together in a scientist friendly
form a la (b) - considering the following signatures (cf.
Haskell, ML, ) - (c) BLAST DNA? DNA
- (d) MotifSearch DNA ? Motif
- (e) MotifSearch o BLAST \x.
MotifSearch(BLAST)(x) - oops (e) is not type correct note the
signatures of (c) and (d)! - a neat solution implicit or explicit iteration /
map(f)x1,,xn - cf. Kepler and Taverna, Kepler solutions
60Extended Example Workflow Evolution
- (a) gt (b) replace Aa?b with Aa?b
- need to call B iteratively i.e. wrap B inside a
component or add control-flow - (b) gt (c) upstream produces a, a,
instead of a, a, - (d) need to bypass data components since B
cant handle ds - This gets messy quickly
61So how to get from messy to clean reusable
designs?
62Answer Collection-Oriented Modeling Design
- Collection-Oriented Modeling Design (COMAD)
- starting point dataflow / actor-oriented
modeling design - embrace the assembly line metaphor fully
- ? Flow-based Programming (J. Morrison)
- data tagged nested collections
- e.g. represented as flattened, pipelined
- (XML) token streams
? Multi-level Pipeline Parallelism!
63How does COMAD work?
- Some COMAD principles
- data tagged, flattened, nested collections
(token streams) - data tokens
- metadata tokens
- inherited downwards into (sub)collections
- define an actors read scope via an (X)Path-like
expression - default actor behavior
- not mine?
- ? dont do anything just pass the buck!
- stuff within my scope? ?
- add-only to it (default)
- consume scope write-out result
- (but remember the bypass!)
- iteration scope is a query involving group-by and
further refines the granularity/subtrees that
constitute the tokens consumed by an actor firing
- has aspects of implicit iteration (a la Taverna)
- default iteration level to fix signature
mismatches - but also
- granularity/grouping is definable
- works on anything (assuming scope is matched
correctly)
64COMAD What we gained
- from fragile, messy workflow designs
- to more reusable actors
- just change the scopes
- sometimes not even that is needed
- and cleaner workflow design
- Crux keep the nesting structure of data (pass
through, add-only) - and let it drive the (semi-)implicit iteration
65Provenance Scientific Workflows
66A Scientific Publication
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
67More Evidence
data reference
type of evidence
tool reference
trust me on this one
- provenance/lineage show the history and evidence
- related to proof trees
- unlike w/ scripts, SWF system can keep track of
what happened - In the future deposit your data workflows in a
repository!?
68Provenance for the WF Engineer / Plumber
- A Workflow Engineers View
- Monitor, benchmark, and optimize workflow
performance - Record resource usage for a workflow execution
- Smart Re-run of (variants of) previous
executions - Checkpointing restart (e.g. for crash recovery,
load balancing) - Debug or troubleshoot a workflow run
- Explain when, where, why a workflow crashed
69Provenance for Domain Scientists!
- Query the lineage of a data product
- from what data was this computed? (real
dependencies please!) - Evaluate the results of a workflow
- do I like how this result was computed?
- Reuse data products of one workflow run in
another - (re-)attach prior data products to a new workflow
- Archive scientific results in a repository
- Replicate the results reported by another
researcher - Discover all results derived from a given dataset
- i.e. across all runs
- Explain unexpected results
- via parameter-, dataset-, object-dependencies
in the scientists terms (yes, you may think
ontology here )
70Observables
- Model of Computation MoC M
- specification/algorithm to compute o M(W,P,i)
- a director or scheduler implements M
- gives rise to formal notions of
- computation (aka run) R typically tree models
- Model of Provenance MoP M
- approximation M of M
- a trace T approximates a run R by
inclusion/exclusion of observables - T R Ignored-observables
Model-observables - Observables (of a MoC M)
- functional observables (may influence output o)
- token rate, notions of firing,
- non-functional observables (not part of M, do not
influence o) - token timestamp, size, (unless the MoC cares
about those) - What is a good model of provenance? What is a
good provenance schema?
71Pipelined workflow for inferring phylogenetic
trees
72Scientific provenance questions we can ask about
a run of this workflow
- What DNA sequences were input to the workflow
(this run)? - What phylogenetic trees were output by the
workflow? - What phylogenetic trees were created
(intermediate or final) by the workflow? - What actor created this phylogenetic tree?
- What sequences input to the workflow does this
consensus tree depend on? - What input sequences were not used to derive any
output consensus trees? - What was the sequence alignment (key intermediate
data) used in the process of inferring this tree? - Which actors were involved in creating this tree?
73Provenance in the COMAD Framework
Without Provenance
With Provenance
74Workflow Design Paradigms
- Vanilla Process Network
- Functional Programming Dataflow Network
- XML Transformation Network
- Collection-oriented Modeling Design framework
75Big Picture Summary
- How did we get here?
- universe started 13.7Gyr (billion years) ago
- considerable chaos (physicists focusing on very
early universe) - earth formed 4.5Gyr first oceans condensed
4.4Gyr - life arose around 3.8 Gyr ago (carbon isotope
data provide evidence for CO2 fixation in
sedimentary rocks at that time) - but to make a rather long and evolved story
short, - a story-telling animal emerged, asking, mostly in
its spare time, increasingly complex questions
about the environment, itself and its buddies,
eventually even contemplating theories which
considered the story-telling animals themselves
to be some sort of adaptive, complex systems,
maybe even just the survival machines or fruiting
bodies of their own internal somewhat selfish
operating system codes (aka genes). - This of course after having first invented the
printing press (thanks Mr. Gutenberg), - and then the internet shortly thereafter (thanks
Al Gore -), - thereby effectively wiring their brains together
to do complex, distributed, concurrent
computations to reveal more and increasingly
complex answers to the same old questions - e-Science accelerates the production of
scientific knowledge using IT - basic nutrients are raw data, observations,
measurements (labeled ordered trees/nested
collections) processed foods rules, equations,
theories, - data information metabolism, i.e., a
description of the knowledge discovery "plan are
well-defined experiment protocols, aka scientific
workflows - (in silico experiments)
- pushing the limits of the latter, thus is likely
the next big jump in evolution.
76Acknowledgements and QA
- NSF/ITR Science Environment for Ecological
Knowledge (SEEK) - NSF/ITR Geosciences Network (GEON)
- DOE/SciDAC Scientific Data Management Center
(SDM) - Data and Knowledge Systems Lab
- Drs. Shawn Bowers, Timothy McPhillips, Norbert
Podhorszki - Dave Thau, Daniel Zinn, Alex Chen
- many Kepler collaborators
77- Contact, addtl. references etc.
- LUDAESCH_at_UCDAVIS.EDU
- daks.ucdavis.edu
78Some Related Publications
- Semantic Type Annotation
- S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006. - S Bowers, B Ludaescher. Towards Automatic
Generation of Semantic Types in Scientific
Workflows. International Workshop on Scalable
Semantic Web Knowledge Base Systems (SSWS), WISE
2005 Workshop Proceedings, LNCS, 2005. - C Berkley, S Bowers, M Jones, B Ludaescher, M
Schildhauer, J Tao. Incorporating Semantics in
Scientific Workflow Authoring. SSDBM, 2005. - B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B
Brodaric, C Baru. Managing Scientific Data From
Data Integration to Scientific Workflows. GSA
Today, Special Issue on Geoinformatics, 2006. - S Bowers, D Thau, R Williams, B Ludaesher. Data
Procurement for Enabling Scientific Workflows On
Exploring Inter-Ant Parasitism. VLDB Workshop on
Semantic Web and Databases (SWDB), 2004. - S Bowers, K Lin, B Ludaescher. On Integrating
Scientific Resources through Semantic
Registration. SSDBM, 2004. - S Bowers, B Ludaescher. An Ontology-Drive
Framework for Data Transformation in Scientific
Workflows. International Workshop on Data
Integration in the Life Sciences (DILS), LNCS,
2004. - S Bowers, B Ludaescher. Towards a Generic
Framework for Semantic Registration of Scientific
Data. International Semantic Web Conference
Workshop on Semantic Web Technologies for
Searching and Retrieving Scientific Data, 2003. - Workflow Design and Modeling
- T McPhillips, S Bowers, B Ludaescher.
Collection-Oriented Scientific Workflows for
Integrating and Analyzing Biological Data.
Workshop on Data Integration in the Life Sciences
(DILS), 2006, to appear. - S Bowers, T McPhillips, B Ludaescher, S Cohen, SB
Davidson. A Model for User-Oriented Data
Provenance in Pipelined Scientific Workflows.
International Provenance and Annotation Workshop
(IPAW), LNCS, 2006. - S Bowers, B Ludaescher, AHH Ngu, T Critchlow.
Enabling Scientific Workflow Reuse through
Structured Composition of Dataflow and
Control-Flow. IEEE Workshop on Workflow and Data
Flow for Scientific Applications (SciFlow), 2006. - S Bowers, B Ludaescher. Actor-Oriented Design of
Scientific Workflows. International Conference on
Conceptual Modeling (ER), LNCS, 2005. - T McPhillips, S Bowers. Pipelining Nested Data
Collections in Scientific Workflows. SIGMOD
Record, 2005. - Kepler
- D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer-Verlag, to appear. - W Michener, J Beach, S Bowers, L Downey, M Jones,
B Ludaescher, D Pennington, A Rajasekar, S
Romanello, M Schildhauer, D Vieglais, J Zhang.
SEEK Data Integration and Workflow Solutions for
Ecology. Workshop on Data Integration in the Life
Sciences (DILS), LNCS, 2005. - S Romanello, W Michener, J Beach, M Jones, B
Ludaescher, A Rajasekar, M Schildhauer, S Bowers,
D Pennington. Creating and Providing Data
Management Services for the Biological and
Ecological Sciences Science Environment for
Ecological Knowledge. SSDBM, 2005.