Title: CSE
1CSE e-Science
- Bertram Ludäscher
- Dept. of Computer Science
- Genome Center
- University of California, Davis
- ludaesch_at_ucdavis.edu
2Computational Science Engineering
- Traditional view
- computational physics, computational chemistry,
big simulations, Teraflops, Petabytes, - Yes, but wait -- there is more
- Emergence of e-Science (UK, Europe),
cyberinfrastructure (NSF), and NIH programs - To illustrate this a bit more
3Science has been changing lately
- THEN All science is either physics or stamp
collecting. - Ernest Rutherford, British chemist
physicist (1871 - 1937) - J. B. Birks "Rutherford at Manchester (1962)
- i.e., from few data, lots of thinking, to
- NOW Lots of Data Analysis
- ? Data-driven scientific discovery!
4The Diversity Unity of Science
Natural Sciences
Earth Sciences
Life Sciences
Physical Sciences
Observations, Measurements, Models, Simulations,
Analyses, Hypotheses Understanding, Prediction,
in vivo, in vitro, in situ, in silico,
Data-, Knowledge-, Workflow- Management is
central to most of them!
compute-intensive
structurally semantics -intensive
data-intensive
metadata-intensive
5e-Science (UK) and Cyberinfrastructure (US)
- e-Science is about global collaboration in key
areas of science and the next generation of
computing infrastructure that will enable it." - Sir John Taylor, Director Office of Science and
Technology, UK - "Cyberinfrastructure is the coordinated aggregate
of software, hardware and other technologies, as
well as human expertise, required to support
current and future discoveries in science and
engineering. The challenge of Cyberinfrastructure
is to integrate relevant and often disparate
resources to provide a useful, usable, and
enabling framework for research and discovery
characterized by broad access and 'end-to-end'
coordination. - Fran Berman, San Diego Supercomputer Center, UCSD
6Towards 2020 Science Report (MSR)
http//research.microsoft.com/towards2020science
- new develoment at the intersection of computer
science and the sciences a leap from the
application of computing to support scientists to
do science (i.e. computational science) to
the integration of computer science concepts,
tools and theorems into the very fabric of
science. We believe this development
represents the foundations of a new revolution in
science - we believe computer science is poised to become
as fundamental to biology as mathematics has
become to physics - to understand cells and cellular systems
requires viewing them as information processing
systems, as evidenced by the fundamental
similarity between molecular machines of the
living cell and computational automata, and by
the natural fit between computer process algebras
and biological signalling and between
computational logical circuits and regulatory
systems in the cell - We highlight that an immediate and important
challenge is that of end-to-end scientific data
management, from data acquisition and data
integration, to data treatment, provenance and
persistence. - dramatic in its impact, will be the integration
of new conceptual and technological tools from
computer science into the sciences.
7Example Assembling the Tree of Life (AToL)
All organisms (alive or extinct) are part of one
large, genetically connected group Life on
Earth. Major subgroups Eubacteria, Archaea, and
Eukaryotesfurther divided into hierarchically
nested subgroups e.g., eukaryotes contains
plants, animals, fungi animals contains
sponges, cnidarians, Bilateria Bilateria
contains arthropods, molluscs, nematodes, etc.
8Inferring a phylogenetic tree from disparate data
Aligned DNA sequences
Maximum likelihood tree (DNA)
Discrete morphological data
Maximum parsimony tree
Integrate
Consensus Tree(s)
Maximum likelihood tree (continuous characters)
Continuous characters
NSF Collaborative Research (w/ UPenn) Core
Database Technologies to Enable the Integration
of AToL Information 462,000 (2006-2009)
Actors
Datasets
Datasets
9A Real-World Example (ChIP-chip workflow)
collaboration with UCD Genome Center
NSF/SEI(BIO)II A Collaborative Scientific
Workflow Environment for Accelerating
Genome-Scale Biological Research 600,139
(2006-2009)
10DOE/SciDAC-2 SDM CPES Fusion Simulation
(Norbert Podhorszki UC Davis, Scott Klasky ORNL)
Monitor
- Plasma physics simulation on 2048 processors on
Seaborg_at_NERSC (LBL) - Gyrokinetic Toroidal Code (GTC) to study energy
transport in fusion devices (plasma
microturbulence) - Generating 800GB of data (3000 files, 6000
timesteps, 267MB/timestep), 30 hour simulation
run - Under workflow control
- Monitor (watch) simulation progress (via remote
scripts) - Transfer from NERSC to ORNL concurrently with the
simulation run - Convert each file to HDF5 file
- Archive files to 4GB chunks into HPSS
DOE/SciDAC-2 Scientific Data Management Center
(w/ LBL) 965,000 (2006-2011)
11Kepler and Sensor Networks
- These ones just in (new NSF CEOP projects)
- Management and Analysis of Environmental
Observatory Data using the Kepler Scientific
Workflow System, NCEAS, SDSC, UC Davis, OSU,
CENS (UCLA), OPeNDAP - standardize services for sensor networks, support
multiple views, protocols - COMET Coast-to-Mountain Environmental Transect,
UC Davis, Bodega Marine Lab, Lake Tahoe Research
Center - study how environmental factors affect ecosystems
along an elevation gradient from coastal
California to the summit of the Sierra Nevada
CEOP--COMET Coast-to-Mountain Environmental
Transect 2,158,580 (2006-209)
CEOP--Management and Analysis of Environmental
Observatory Data Using the Kepler Scientific
Workflow System 290,000 (collaborative 2.9M)
(2006-2010)
CEOP/REAP
12Scientific Workflows Cyberinfrastructure
UPPER-WARE
Upperware
Upper Middleware
Middleware
Underware
NSF ITR (w/ SDSC) Science Environment for
Ecological Knowledge (SEEK) 2,485,683
(2002-2007)
13Consilience The Unity of Knowledge (E. O. Wilson)
- "Literally a jumping together of knowledge by the
linking of facts and fact-based theory across
disciplines to create a common groundwork for
explanation." E.O.Wilson - eScience, Cyberinfrastructure mechanisms to make
progress - Scientific Workflows crucial elements to get the
most mileage out of CI to fuel eScience,
accelerating knowledge discovery - CSE needs computer scientists, domain scientists,
hybrids (e.g. bioinformaticians,
computational/simulation scientists)
14Some Related Publications
- Semantic Type Annotation
- S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006. - S Bowers, B Ludaescher. Towards Automatic
Generation of Semantic Types in Scientific
Workflows. International Workshop on Scalable
Semantic Web Knowledge Base Systems (SSWS), WISE
2005 Workshop Proceedings, LNCS, 2005. - C Berkley, S Bowers, M Jones, B Ludaescher, M
Schildhauer, J Tao. Incorporating Semantics in
Scientific Workflow Authoring. SSDBM, 2005. - B Ludaescher, K Lin, S Bowers, E Jaeger-Frank, B
Brodaric, C Baru. Managing Scientific Data From
Data Integration to Scientific Workflows. GSA
Today, Special Issue on Geoinformatics, 2006. - S Bowers, D Thau, R Williams, B Ludaescher. Data
Procurement for Enabling Scientific Workflows On
Exploring Inter-Ant Parasitism. VLDB Workshop on
Semantic Web and Databases (SWDB), 2004. - S Bowers, K Lin, B Ludaescher. On Integrating
Scientific Resources through Semantic
Registration. SSDBM, 2004. - S Bowers, B Ludaescher. An Ontology-Drive
Framework for Data Transformation in Scientific
Workflows. International Workshop on Data
Integration in the Life Sciences (DILS), LNCS,
2004. - S Bowers, B Ludaescher. Towards a Generic
Framework for Semantic Registration of Scientific
Data. International Semantic Web Conference
Workshop on Semantic Web Technologies for
Searching and Retrieving Scientific Data, 2003. - Workflow Design and Modeling
- T McPhillips, S Bowers, B Ludaescher.
Collection-Oriented Scientific Workflows for
Integrating and Analyzing Biological Data.
Workshop on Data Integration in the Life Sciences
(DILS), LNCS, 2006. - S Bowers, T McPhillips, B Ludaescher, S Cohen, SB
Davidson. A Model for User-Oriented Data
Provenance in Pipelined Scientific Workflows.
International Provenance and Annotation Workshop
(IPAW), LNCS, 2006. - S Bowers, B Ludaescher, AHH Ngu, T Critchlow.
Enabling Scientific Workflow Reuse through
Structured Composition of Dataflow and
Control-Flow. IEEE Workshop on Workflow and Data
Flow for Scientific Applications (SciFlow), 2006. - S Bowers, B Ludaescher. Actor-Oriented Design of
Scientific Workflows. International Conference on
Conceptual Modeling (ER), LNCS, 2005. - T McPhillips, S Bowers. Pipelining Nested Data
Collections in Scientific Workflows. SIGMOD
Record, 2005. - Kepler
- D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer-Verlag, to appear. - W Michener, J Beach, S Bowers, L Downey, M Jones,
B Ludaescher, D Pennington, A Rajasekar, S
Romanello, M Schildhauer, D Vieglais, J Zhang.
SEEK Data Integration and Workflow Solutions for
Ecology. Workshop on Data Integration in the Life
Sciences (DILS), LNCS, 2005. - S Romanello, W Michener, J Beach, M Jones, B
Ludaescher, A Rajasekar, M Schildhauer, S Bowers,
D Pennington. Creating and Providing Data
Management Services for the Biological and
Ecological Sciences Science Environment for
Ecological Knowledge. SSDBM, 2005.
15Kepler Collaboration
- Open-source
- Builds on Ptolemy II from UC Berkeley
- Contributors from
- SEEK
- SciDAC SDM
- Ptolemy
- GEON
- ROADNet
- Resurgence
- AToL CIPRES, POD
-
- Goals
- Create powerful analytical tools that are useful
across disciplines - Ecology, Biology, Engineering, Geology, Physics,
Chemistry, Astronomy,
Ptolemy II
Natural Diversity Discovery Project
16Databases Information Systems (DBIS)
DBIS.ucdavis.edu
DAKS.ucdavis.edu
- Profs. Michael Gertz, Bertram Ludaescher
- Drs. Shawn Bowers, Timothy McPhillips, Norbert
Podhorszki - 12 graduate students