Title: Databases
1Databases Information SystemsProjects
Research Overview DBIS.ucdavis.edu
- Michael Gertz
- Bertram Ludäscher
- Dept. of Computer Science
- University of California, Davis
- gertz,ludaesch_at_ucdavis.edu
2Databases and Information Systems (DBIS)
- DBIS.ucdavis.edu_at_ Dept of Computer Science (CS)
- DAKS.ucdavis.edu (Data Knowledge Systems) _at_
Genome Center (GC) - Faculty
- Michael Gertz Bertram Ludäscher
- Researchers
- Drs. Shawn Bowers (GC), Timothy McPhillips (GC),
Norbert Podhorszki (CS) - Current Students
- Omar Alonso, Michael Byrd, Conny Franke,
-
- Quinn Hart, Carlos Rueda, Dave Thau, Alex Chen
3Projects Research Areas
- Ongoing Collaborations
- NSF/ITR GeoStreams
- NSF/ITR GEON (Geosciences Network)
- NSF/ITR SEEK (Science Environment for Ecological
Knowledge) - DOE/SciDAC SDM (Scientific Data Management
Center) - DOE/CPES (Center for Plasma Edge Simulation)
- New Projects
- NSF/CEOP COMET (Coast-to-Mountain Environmental
Transect) - NSF/CEOP Kepler (Real-time Problem Solving
Environment) - NSF/AToL pPOD (Processing Phylogenetic Data)
- NSF/SEII ChIP-chip (Bioinformatics Workflows)
- Research Areas
- scientific data management, scientific workflows
- streaming data, geospatial data, data security
- knowledge representation, data integration
4The Diversity Unity of Science
Natural Sciences
Earth Sciences
Life Sciences
Physical Sciences
Observations, Measurements, Models, Simulations,
Analyses, Hypotheses Understanding, Prediction,
in vivo, in vitro, in situ, in silico,
Data-, Knowledge-, Workflow- Management is
central to most of them!
compute-intensive
structurally semantics -intensive
data-intensive
metadata-intensive
5Towards 2020 Science Report (MSR)
http//research.microsoft.com/towards2020science
- new develoment at the intersection of computer
science and the sciences a leap from the
application of computing to support scientists to
do science (i.e. computational science) to
the integration of computer science concepts,
tools and theorems into the very fabric of
science. We believe this development
represents the foundations of a new revolution in
science - we believe computer science is poised to become
as fundamental to biology as mathematics has
become to physics - to understand cells and cellular systems
requires viewing them as information processing
systems, as evidenced by the fundamental
similarity between molecular machines of the
living cell and computational automata, and by
the natural fit between computer process algebras
and biological signalling and between
computational logical circuits and regulatory
systems in the cell - We highlight that an immediate and important
challenge is that of end-to-end scientific data
management, from data acquisition and data
integration, to data treatment, provenance and
persistence. - dramatic in its impact, will be the integration
of new conceptual and technological tools from
computer science into the sciences.
6Types of Information Integration
- Conventional data integration
- schema-based
- view-based
- at the data-level
- Spatial (co-)registration/overlay of different
data - from 2D, 3D, 4D (x,y,z,t), (4n) D ? GIS
- Extended DI approaches using ontologies
- controlled vocabularies, metadata, annotations
- Scientific Information Integration
- data process/application integration
- Scientific Workflows
- can include all the others and
- statistics, data mining, visualization,
7e-Science (UK) and Cyberinfrastructure (US)
- e-Science is about global collaboration in key
areas of science and the next generation of
computing infrastructure that will enable it." - Sir John Taylor, Director Office of Science and
Technology, UK - "Cyberinfrastructure is the coordinated aggregate
of software, hardware and other technologies, as
well as human expertise, required to support
current and future discoveries in science and
engineering. The challenge of Cyberinfrastructure
is to integrate relevant and often disparate
resources to provide a useful, usable, and
enabling framework for research and discovery
characterized by broad access and 'end-to-end'
coordination. - Fran Berman, San Diego Supercomputer Center, UCSD
8 Integrated Cyberinfrastructure System meeting
the needs of multiple communities Source Dr.
Deborah Crawford, Chair, NSF CyberInfrastructure
Working Group
- Applications
- Environmental Science
- High Energy Physics
- Biomedical Informatics
- Geoscience
DevelopmentTools Libraries
Education and Training
Discovery Innovation
Grid Services Middleware
Hardware
9Scientific Workflows Cyberinfrastructure
UPPER-WARE
10Scientific Information Integration
- Conventional Data Integration
- syntactic structural heterogeneities, schema
mappings, schema matching, query rewriting
(GAV,GLAV, ), - dealing with fundamentally same kind of
information - that happens to be represented differently,
incompletely, - find the correct, best way to integrate
different representations - Scientific Information Integration (SII)
- has the traditional II as a small (but very
important) piece - but often deals with combining fundamentally
different information - not a single correct / best way to integrate
- invokes scientific theories or models that cannot
be inferred from the data, schema, ontologies - ? joining of data, chaining of tools is in
the scientists head! - ? scientific workflows can provide the end-to-end
framework
11Information Integration A Tree of Life (AToL)
- many AToL projects
- ? need to integrate the integrators (biologists)
data
12Src Junhyong Kim, Department of Biology, Penn
Center for Bioinformatics, U Penn
13Inferring a phylogenetic tree from disparate data
Aligned DNA sequences
Maximum likelihood tree (DNA)
Discrete morphological data
Maximum parsimony tree
Integrate
Consensus Tree(s)
Maximum likelihood tree (continuous characters)
Continuous characters
Actors
Datasets
Datasets
14Scientific Workflow
- Capture how a scientist works with data and
analytical tools - data access, transformation, analysis,
visualization - possible worldview dataflow-oriented (cf.
signal-processing) - Scientific workflow (wf) benefits (compare w/
script-based approaches) - wf automation
- wf component reuse
- wf design, documentation
- wf archival, sharing
- built-in concurrency
- (task-, pipeline-parallelism)
- built-in provenance support
- distributed execution
- (Grid) support
-
15Kepler Ecological Niche Modeling Pipeline
- Scientific Workflow paradigm
- Reusable components (actors) a scientists
verbs/actions - Top-level workflows conceptual representation
of the science process, sentences in the
scientists language - Sub-workflows increasing levels of detail
- Separation of concerns
- actors what to do
- parameters configurable behavior
- channels dataflow, pipeline composition
- directors fix execution model, scheduling
- semantic types smart discovery, linking
D Pennington, D Higgins, AT Peterson, M Jones, B
Ludaescher, S Bowers. Ecological Niche Modeling
using the Kepler Workflow System. Workflows for
e-Science, Springer.
16Kepler and Sensor Networks
- New NSF CEOP projects
- Management and Analysis of Environmental
Observatory Data using the Kepler Scientific
Workflow System, NCEAS, SDSC, UC Davis, OSU,
CENS (UCLA), OPeNDAP - standardize services for sensor networks, support
multiple views, protocols - COMET Coast-to-Mountain Environmental Transect,
UC Davis, Bodega Marine Lab, Lake Tahoe Research
Center - study how environmental factors affect ecosystems
along an elevation gradient from coastal
California to the summit of the Sierra Nevada
CEOP/COMET
CEOP/Kepler
17Simple Kepler workflow using R (a statistics
package)
18Plumbing with Style (Norbert Podhorszki UC
Davis, Scott Klasky ORNL)
Monitor
- Plasma physics simulation on 2048 processors on
Seaborg_at_NERSC (LBL) - Gyrokinetic Toroidal Code (GTC) to study energy
transport in fusion devices (plasma
microturbulence) - Generating 800GB of data (3000 files, 6000
timesteps, 267MB/timestep), 30 hour simulation
run - Under workflow control
- Monitor (watch) simulation progress (via wf
actors) - Transfer from NERSC to ORNL concurrently with the
simulation run - Convert each file to HDF5 file
- Archive files to 4GB chunks into HPSS
19Some Research Challenges
- Goal helping scientists and workflow engineers
in SII - to optimize the human resource
- workflow modeling design
- software engineering, query optimization, type
inference - rich provenance support
- data models, computation models, query languages
- use/exploit semantic information
- logic-based reasoning
- and to optimize system resources
- resource scheduling, distributed execution,
- cost models, scheduling, distributed computing
-
20Behold the Beauty of Scientific Workflow Design
Author Kristian Stevens, UC Davis
Src Kristian Stevens, ECS-289F, 2006
21 Shimology Part 2 the ugly truth inside
Author Kristian Stevens, UC Davis
22Challenge Modeling Design Paradigms
- Vanilla Process Network
- Functional Programming Dataflow Network
- XML Transformation Network
- Collection-oriented Modeling Design framework
(COMAD)
The limitations of my modeling language are the
limitations of my design world. BL
23CS Challenge Hybrid (semantic structural) Types
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
24CS Challenge Propagating Semantic Types
- Creating semantic annotations is difficult
- Potentially large numbers of derived data
products - Thousands of workflow components
- Getting it right can be difficult for the
domain scientist - ? Annotation Propagation
?
?1
?2
?3
Forward Propagation
Automatically Derive Annotations
?
?1
?2
?3
Backward Propagation
Automatically Derive Annotations
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
25CS Research Problems in Propagation
- Computing Forward and Backward Propagation
- Under different schema constraint languages
- What can and cannot be computed
- Approximate what cannot be computed
- Algorithms for propagation through a single actor
- Algorithms for propagation through an entire
workflow
Biom1(ob, yr, seas, plt, spp, bm) -
Biom(ob, yr, seas, plt, spp, bm), Sscd(spp).
Biom3(yr, plt, spp, 1) - Biom2(yr, plt,
spp, bm), bm gt 0 Biom3(yr, plt, spp, 0) -
Biom2(yr, plt, spp, bm), bm lt 0
Biom2(yr, plt, spp, z ? sum(b y, t, p)) -
Biom1(ob, yr, seas, plt, spp, bm).
union
join
aggregation
26Example queries and annotations
S
R1(o, x, y, t, v)
?
R1, R2
S
Actor A
R2(u, p)
?o,x,y,v
?ud
S(o, x, y, v, u, p)
?tc
q ?o,x,y,v(?tc(R1)) ? ?ud(R2)
R2
R1
- Forward propagation
- ?1 R1(o, x, y, t, v) ? Observation(o) ?
hasVal(o, v) - ?2 R2(u, p) ? Site(u) ? Species(p) ?
observedIn(p, u) - ?? ?(q?) where ? ?1 ? ?2
- Backward propagation
- ?? S(o, x, y, v, u, p) ? Observation(o) ?
hasVal(o, v) ? Species(p) - ? ??(q)
S Bowers, B Ludaescher. A Calculus for
Propagating Semantic Annotations through
Scientific Workflow Queries. ICDE Workshop on
Query Languages and Query Processing (QLQP),
LNCS, 2006.
27Results on S-T Finite Dependencies (Fagin et al)
- Full dependencies Lfull (e.g., ?/??, ?, ?/??,
?) ?x ?(x) ? ?(x) - Embedded dependencies Lem (e.g., ??) ?x ?(x) ?
?y ?(x, y) - Skolemized dependencies LSko
- ?f ?x ?(x), ?(x) ? ?(x)
- Composition (we want L?(Lq?) ? L? )
- Lfull(Lfull) ? Lfull Lfull(Lem) ? Lfull
- Lem(Lfull) ? Lem Lem(Lem) ? Lem
- LSko(LSko) ? LSko
- In general, annotations take the form of
embedded (or Skolemized) s-t dependencies
28A Scientific Publication (the final PROVENANCE
frontier )
Title (Statement, Theorem)
Abstract (1st-Level- Expansion)
Main Text (2nd-Level Expansion)
Nature 443, 167-172(14 September 2006)
doi10.1038/nature05113 Received 27 June 2006
Accepted 25 July 2006 Published online 16 August
2006
some metadata
29More Evidence
data reference
type of evidence
tool reference
trust me on this one
- provenance/data lineage show the history and
evidence - related to proof trees
- unlike w/ scripts, SWF system can keep track of
what happened - In the future deposit your data workflows in a
repository
30SUMMARY (Part 1)
Data Integration
Knowledge Representation
Process Integration
31Geospatial Data (everything happens somewhere,
sometime)
- Spatial data
- Data with a spatial location in a given
reference frame, which is a perspective of the
viewer to describe physical quantities (e.g.,
position, velocity) a coordinate system is a way
to describe physical quantities in a perspective.
- Geospatial data
- Data whose underlying reference frame is the
Earths surface - concerns phenomena above, on and below the
Earths surface. - Sources of (geo)spatial data
- Remotely-sensed data
- Aerial photography
- Digitized maps
- Field surveys
- Sensor networks
- Simulations
32Geospatial Data (cont.)
- About 80 of all data have a spatial component !
- Interest in geospatial data and Geographic
Information Systems (GIS) is witnessing a
dramatic increase that goes beyond traditional
GIS uses.
- Further applications include
- Economic development
- Geo-marketing
- Mobile services
- Utility management
- Disaster management
- Transportation networks
- Biodiversity
- Climatology
- Earthquake monitoring
- ...
33Remotely-Sensed Data
34Remotely-Sensed Data
- Several hundred operational satellites are up
there, streaming dozens of terabytes of
geospatial data down to Earth per day. - ExampleGeostationary Observational Environmental
Satellite (GOES)
- Data is obtained in row-scan-
- order, from north to south
- and east to west
- Data collection is based on
- routine schedule for 11 sectors
- 3,000x3,000 km region is
- scanned in about 3 minutes
- Downlink rate about 2.1Mbps
- (approximately 21GB/day)
135 W longitude
75 W longitude
Typical approach to data processing File-based,
i.e., store raster image data and derive some
standard data products or let users upload data
for further processing only 6-12 of the data
is actually used !
35The GeoStreams Project
Goal Process (continuous) user queries over
streams of raster images
36GeoStreamsResearch Problems
- Objectives
- (1) Develop stream processing framework for
remotely-sensed imagery (RSI) - (2) Support complex (non-standard DBMS)
operations on various forms of - streams of RSI
- (3) Exploit techniques and concepts developed
for traditional stream systems - (4) Use real data sets/streams and data product
requirements - Problems
- How does one model streams of raster image data?
- Image algebra, points sets (pixels), value
sets (pixel values),... - What are operations on streams of raster image
data? - Standard operations
- Spatial, temporal, and value selections (e.g.,
give me the temperature values for the query box
over Davis every day at 1pm) - Value transforms (e.g., contrast/Gaussian
stretch, histogram equalization) - Spatial transforms (e.g., map re-projection,
zooming)
37GeoStreamsResearch Problems (cont.)
- How to compose streams G1 and G2 ?
- G1 ? G2 (x,G1(x) ? G2(x)) x ? X, ? ? , ?,
,/ - How to build complex queries ? Example
- G1 NIR (near-infrared), G2 VIS (visible)
- G ((fval ? ((G1-G2) / (G2G1))) ? fUTM)R
- What abut spatio-temporal aggregates?
38GeoStreamsResearch Problems (cont.)
- How to implement individual operators (blocking
vs. non-blocking)? - What other operators are of practical relevance?
- For example, change detection, combination of
stream data with persistent (static) geospatial
data, ... - Optimization issues
- Minimal algebraic query optimization framework
(e.g., push down of spatial selection over
spatial transform) - Multiple-query optimization, i.e., how to share
data and operators among multiple (continuous)
queries? - How to design a scalable distributed query
processing framework? - cluster computing ? operator scheduling
techniques - Web services ? each service provides some data
products
For more information about this NSF ITR funded
project, visit http//geostreams.ucdavis.edu
39Beyond Streaming DataThe COMET Project
- COMET Coast-to-Mountain Environmental Transect
- Funded in October 2006 through NSF
Cyberinfrastructure for Earth Observatories
Program (CEOP) at a level of 2.1M over three
years. - Participants
- M. Gertz (PI), B. Ludäscher (Computer Science)
- G. Schladow (Director, Tahoe Environmental
Research Center) - S. Williams (Director, Bodega Bay Marine Lab), I.
Faloona, J. Largier - S. Ustin (Director, CalSpace, Cstars), Q. Hart
- K.T. Paw U, Shu-Hua Chen (Atmospheric Sciences
Climatology) - Objective
-
Develop a practical cyberinfrastructure (CI)
prototype to facilitate the study of the way in
which multiple environmental factors, including
climatic variability, affect major ecosystems
along an elevation gradient from coastal
California to the summit of the Sierra Nevada.
This CI will be based around the integration of
access to distributed and varied data collections
and sensor data streams, registration of data,
models and analysis tools, semantically-aware
data query mechanisms, and an orchestration
system for advanced scientific workflows. Access
to this CI will be provided through a Web-based
portal.
40Beyond Streaming DataThe COMET Project
- What transect, what data?
Ecological data, NEXRAD (Doppler radar), NOAA
AVHRR, CalTrans sensor data, ....
41The COMET ProjectVision
CISAME CyberInfrastructure System for Data
Assimilation and Model Management for the
Environment
42The COMET ProjectResearch Problems
- There are many non-CS questions regarding
climate variability, impact on ecosystems, El
Nino, costal marine communities, changes in
upwelling strength, carbon flux, particle fluxes
and depositions,.... - Scientific data management issues
- Make all data and data products readily
accessible to users and applications at the
necessary spatial and temporal resolution - Provide semantic data registration for streaming
sensor data, satellite imagery, various forms of
geospatial data, and so on - Provide data and standard data products in
different data formats - Spatially and temporally synchronize data
- Fully integrate complex climate models and
applications and make them readily accessible to
many users/scientists through the Portal - Requires modeling of complex models as scientific
workflows that ingest and produce diverse types
of data - Such workflows need to be optimized and
coordinated (multiple-workflow optimization)
43The COMET ProjectResearch Problems
- A simple example The Weather Research
Forecasting Model
44Security and Privacy of Geospatial Data
- Project recently started with UT Dallas and
Purdue University -
- Objectives
- Improve the security of geospatial data
repositories that are managed by different state,
county, and municipal organizations and accessed
through GIS and applications. - Introduce and advance concepts, techniques, and
architectures for security models and policies
for geospatial data, including topographic/themati
c maps and aerial/satellite imagery. - Modular and compositional security policies as a
comprehensive framework to model and reason about
multi-granular, context-driven, dynamic, and
location-aware security and privacy requirements
of GIS repositories and applications. - Trust and integrity management models and
techniques for geospatial data -
-
45S P of Geospatial Data (cont.)
- What is the problem?
- Numerous government, county, and municipal
organizations manage thematic and topographical
maps in support of disaster and emergency
management, homeland security, and environmental
crises provide geospatial data for various
features of U.S. locations and facilities at very
fine-grained levels of detail - GIS repositories and GIS Web services have no
mechanisms for securing geospatial data - Overlay of GIS layers may reveal sensitive
information (inference problem for geospatial
data) - What is an appropriate policy specification
framework that combines field-based and
feature-based data, active policies (e.g.,
obfuscation of objects), event-based policies,
context-based policies (e.g., location
awareness)? - How to combine developments with OGC standards
and technologies?
46S P of Geospatial Data (cont.)
- Framwork and architecture
DATA PRESENTATION LAYER
Traditional GIS
Open Geospatial Consortium Framework Core
Application Schemas Geospatial Features Geogra
phy Markup Language Metadata
GIS Web Services
Wrapper
SECURITY LAYER
Trust Privacy Management
Policy Specifications
Access Control Mechanisms
Authentic Data Publication
Policy Reasoning Engine
DATA INTEROPERATION ACCESS LAYER
GIS Interoperation Services GIS Data Repository
Access
GIS Data Repositories
47S P of Geospatial Data (cont.)
- Question What are privacy threats in the
context of geospatial data, in particular
satellite imagery?
48Teaching
- We offer several classes related to our research
areas and project activities - ECS 165A (Database Systems)
- ECS 165B (Advanced Database Systems)
- ECS 166 (Scientific Data Management)
- ECS 265 (Distributed Database Systems)
- ECS 289F (Spatial Databases)
- ECS 289F (Topics in Scientific Data Management)
- ECS 289A/F (Logics and Knowledge Representation)
- DBIS Seminar (Fridays 1-230pm)
49Q A
DBIS.ucdavis.edu
DAKS.ucdavis.edu
kepler-project.org