Scientific Data

1 / 76
About This Presentation
Title:

Scientific Data

Description:

Scientific Data – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 77
Provided by: bent83

less

Transcript and Presenter's Notes

Title: Scientific Data


1
Scientific Data Workflow EngineeringPreliminary
Notes from the Cyberinfrastructure Trenches
  • Bertram Ludäscher

Associate Professor Dept. of Computer Science
Genome Center University of California, Davis
Fellow San Diego Supercomputer Center University
of California, San Diego
2
Outline
  • Introduction CI Sample Architectures
  • Scientific Data Integration
  • Scientific Workflow Management
  • Links Crystallization Points
  • Lessons learnt Summary

3
Science Environment for Ecological Knowledge
(SEEK) Overview
  • Domain Science Driver
  • Ecology (LTER), biodiversity,
  • Analysis Modeling System
  • Design execution of ecological models
    analysis (scientific workflows)
  • application,upper-ware
  • ? Kepler system
  • Semantic Mediation System
  • Data Integration of hard-to-relate sources and
    processes
  • Semantic Types and Ontologies
  • upper middleware
  • ? Sparrow Toolkit
  • EcoGrid
  • Access to ecology data and tools
  • middle,under-ware
  • ? unified API to SRB/MCAT, MetaCat, DiGIR,
    datasets

sample CS problem DILS04
4
Common CI Infrastructure Pieces
  • Other CI-projects (e.g. GEON, ) have similar
    service-oriented architectures
  • Seamless and uniform data access (Data-Grid)
  • data metadata registry
  • distributed and high performance computing
    platform (Compute-Grid)
  • service registry
  • Federated, integrated, mediated databases
  • often use of semantic extensions (e.g.
    ontologies)
  • User-friendly workbench / problem-solving
    environment
  • ? scientific workflows
  • add to this sensors, observing systems

5
Example Realtime Environment for Analytical
Processing (REAP vision)
6
The Great Unified System
  • Many engineering and CS challenges!
  • well see some
  • Our focus
  • Scientific data integration
  • How to associate, mediate, integrate complex
    scientific data?
  • Scientific workflows
  • How to devise larger scientific workflows for
    process automation from individual components
    (e.g. web services)?
  • Disclaimer
  • often scratching the surface see references
    research literature for details

7
Outline
  • Introduction CI Sample Architectures
  • Scientific Data Integration
  • Scientific Workflow Management
  • Links Crystallization Points
  • Lessons learnt Summary

8
An Online Shoppers Information Integration
Problem
El Cheapo Where can I get the cheapest copy
(including shipping cost) of Wittgensteins
Tractatus Logicus-Philosophicus within a week?
One-World Mediation
Mediator (virtual DB) (vs. Datawarehouse) NOTE
non-trivial data engineering challenges!
9
A Home Buyers Information Integration Problem
What houses for sale under 500k have at least 2
bathrooms, 2 bedrooms, a nearby school ranking
in the upper third, in a neighborhood with
below-average crime rate and diverse population?
Multiple-Worlds Mediation
10
A Neuroscientists Information Integration Problem
Biomedical Informatics Research
Network http//nbirn.net
What is the cerebellar distribution of rat
proteins with more than 70 homology with human
NCS-1? Any structure specificity? How about other
rodents?
Complex Multiple-Worlds Mediation
  • Inter-source links
  • unclear for the non-scientists
  • hard for the scientist

11
(No Transcript)
12
Interoperability Integration Challenges
  • System aspects Grid Middleware
  • distributed data computing, SOA
  • web services, WSDL/SOAP, WSRF, OGSA,
  • sources functions, files, data sets
  • Syntax Structure
  • (XML-Based) Data Mediators
  • wrapping, restructuring
  • (XML) queries and views
  • sources (XML) databases
  • Semantics
  • Model-Based/Semantic Mediators
  • conceptual models and declarative views
  • Knowledge Representation ontologies, description
    logics (RDF(S),OWL ...)
  • sources knowledge bases (DBCMsICs)
  • Synthesis Scientific Workflow Design Execution
  • Composition of declarative and procedural
    components into larger workflows
  • (re)sources services, processes, actors,
  • reconciling S5 heterogeneities
  • gluing together resources
  • bridging information and knowledge gaps
    computationally

13
Information Integration Challenges S4
Heterogeneities
  • System aspects
  • platforms, devices, data service distribution,
    APIs, protocols,
  • ? Grid middleware technologies
  • e.g. single sign-on, platform independence,
    transparent use of remote resources,
  • Syntax Structure
  • heterogeneous data formats (one for each tool
    ...)
  • heterogeneous data models (RDBs, ORDBs, OODBs,
    XMLDBs, flat files, )
  • heterogeneous schemas (one for each DB ...)
  • ? Database mediation technologies
  • XML-based data exchange, integrated views,
    transparent query rewriting,
  • Semantics
  • descriptive metadata, different terminologies,
    hidden semantics (context), implicit
    assumptions,
  • ? Knowledge representation semantic mediation
    technologies
  • smart data discovery integration
  • e.g. ask about X (mafic) find data about Y
    (diorite) be happy anyways!

14
Information Integration Challenges S5
Heterogeneities
  • Synthesis of applications, analysis tools, data
    query components, into scientific workflows
  • How to make use of these wonderful things put
    them together to solve a scientists problem?
  • ? Scientific Problem Solving Environments (PSEs)
  • Portals,Workbench (scientists view)
  • ontology-enhanced data registration, discovery,
    manipulation
  • creation and registration of new data products
    from existing ones,
  • Scientific Workflow System (engineers view)
  • for designing, re-engineering, deploying
    analysis pipelines and scientific workflows a
    tool to make new tools
  • e.g., creation of new datasets from existing
    ones, dataset registration,

15
Information Integration from a Database
Perspective
  • Information Integration Problem
  • Given data sources S1, ..., Sk (databases, web
    sites, ...) and user questions Q1,..., Qn that
    can in principle be answered using the
    information in the Si
  • Find the answers to Q1, ..., Qn
  • The Database Perspective source database
  • Si has a schema (relational, XML, OO, ...)
  • Si can be queried
  • define virtual (or materialized) integrated (or
    global) view G over local sources S1 ,..., Sk
    using database query languages (SQL, XQuery,...)
  • questions become queries Qi against G(S1,..., Sk)

16
Standard (XML-Based) Mediator Architecture
USER/Client
3. Q1 Q2 Q3
4. answers(Q1)
answers(Q2) answers(Q3)
17
Query Planning in Data Integration
  • Given
  • Declarative user query Q answer() ? G ...
  • G ? S global-as-view (GAV)
  • S ? G local-as-view (LAV)
  • ic() ? S G integrity constraints
    (ICs)
  • Find
  • equivalent (or minimal containing, maximal
    contained)
  • query plan Q answer() ? S
  • ? query rewriting (logical/calculus, algebraic,
    physical levels)
  • Results
  • A variety of results/algorithms depending on
    classes of queries, views, and ICs P, NP, ,
    undecidable
  • hot research area in core CS (database community)

18
Scientific Data Integration using Semantic
Extensions
19
(No Transcript)
20
Example Geologic Map Integration
  • Given
  • Geologic maps from different state geological
    surveys (shapefiles w/ different data schemas)
  • Different ontologies
  • Geologic age ontology (e.g. USGS)
  • Rock classification ontologies
  • Multiple hierarchies (chemical, fabric, texture,
    genesis) from Geological Survey of Canada (GSC)
  • Single hierarchy from British Geological Survey
    (BGS)
  • Problem
  • Support uniform queries across all map
  • using different ontologies
  • Support registration w/ ontology A, querying w/
    ontology B

21
Schema Integration (registering local schemas
to the global schema)
ABBREV
Arizona
PERIOD
FORMATION
Idaho
AGE
NAME
Colorado
PERIOD
LITHOLOGY
Utah
TYPE
PERIOD
Nevada
FMATN
TIME_UNIT
Wyoming
NAME
Livingston formation
FORMATION
PERIOD
Tertiary-Cretaceous
Montana West
AGE
New Mexico
NAME
PERIOD
LITHOLOGY
andesitic sandstone
Montana E.
FORMATION
PERIOD
22
Multihierarchical Rock Classification Ontology
(Taxonomies) for Thematic Queries (GSC)
Genesis
Fabric
Composition
Texture
23
Ontology-Enabled Application ExampleGeologic
Map Integration
24
Querying by Geologic Age
25
Querying by Geologic Age Results
26
Querying by Chemical Composition (GSC)
27
Semantic Mediation (via semantic registration
of schemas and ontology articulations)
  • Schema elements and/or data values are associated
    with concept expressions from the target ontology
  • ? conceptual queries through the ontology
  • Articulation ontology
  • ? source registration to A, querying through B
  • Semantic mediation query rewriting w/ ontologies

semantic registration
Ontology A
Database1
Concept-based (semantic) queries
ontology articulations
Ontology B
Database2
semantic registration
28
Different views on State Geological Maps
29
Sedimentary Rocks BGS Ontology
30
Sedimentary Rocks GSC Ontology
31
Implementation in OWL Not only for the machine

32
Source Contextualization through Ontology
Refinement
  • sources can register new concepts at the
    mediator ...

33
Outline
  • Introduction CI Sample Architectures
  • Scientific Data Integration
  • Scientific Workflow Management
  • Links Crystallization Points
  • Lessons learnt Summary

34
What is a Scientific Workflow (SWF)?
  • Goals
  • automate a scientists repetitive data management
    and analysis tasks
  • typical phases
  • data access, scheduling, generation,
    transformation, aggregation, analysis,
    visualization
  • ? design, test, share, deploy, execute, reuse,
    SWFs

35
Promoter Identification Workflow
Source Matt Coleman (LLNL)
36
Source NIH BIRN (Jeffrey Grethe, UCSD)
37
Ecology GARP Analysis Pipeline for Invasive
Species Prediction
Source NSF SEEK (Deana Pennington et. al, UNM)
38
(No Transcript)
39
Commercial Open Source Scientific Workflow
(well Dataflow) Systems
Kensington Discovery Edition from InforSense
Triana
Taverna
40
SCIRun Problem Solving Environments for
Large-Scale Scientific Computing
  • SCIRun PSE for interactive construction,
    debugging, and steering of large-scale scientific
    computations
  • Component model, based on generalized dataflow
    programming

Steve Parker (cs.utah.edu)
41
Ptolemy II
see!
read!
try!
Source Edward Lee et al. http//ptolemy.eecs.berk
eley.edu/ptolemyII/
42
Why Ptolemy II (and thus KEPLER)?
  • Ptolemy II Objective
  • The focus is on assembly of concurrent
    components. The key underlying principle in the
    project is the use of well-defined models of
    computation that govern the interaction between
    components. A major problem area being addressed
    is the use of heterogeneous mixtures of models of
    computation.
  • Dataflow Process Networks w/ natural support for
    abstraction, pipelining (streaming)
    actor-orientation, actor reuse
  • User-Orientation
  • Workflow design exec console (Vergil GUI)
  • Application/Glue-Ware
  • excellent modeling and design support
  • run-time support, monitoring,
  • not a middle-/underware (we use someone elses,
    e.g. Globus, SRB, )
  • but middle-/underware is conveniently accessible
    through actors!
  • PRAGMATICS
  • Ptolemy II is mature, continuously extended
    improved, well-documented (500pp)
  • open source system
  • Ptolemy II folks actively participate in KEPLER

43
KEPLER/CSP Contributors, Sponsors, Projects(or
loosely coupled Communicating Sequential Persons
-)
  • Ilkay Altintas SDM, Resurgence
  • Kim Baldridge Resurgence, NMI
  • Chad Berkley SEEK
  • Shawn Bowers SEEK
  • Terence Critchlow SDM
  • Tobin Fricke ROADNet
  • Jeffrey Grethe BIRN
  • Christopher H. Brooks Ptolemy II
  • Zhengang Cheng SDM
  • Dan Higgins SEEK
  • Efrat Jaeger GEON
  • Matt Jones SEEK
  • Werner Krebs, EOL
  • Edward A. Lee Ptolemy II
  • Kai Lin GEON
  • Bertram Ludaescher SEEK, GEON, SDM, ROADNet, BIRN
  • Mark Miller EOL
  • Steve Mock NMI
  • Steve Neuendorffer Ptolemy II

Ptolemy II
www.kepler-project.org
44
KEPLER An Open Collaboration
  • Initiated by members from NSF SEEK and DOE
    SDM/SPA now several other projects (GEON,
    Ptolemy II, EOL, Resurgence/NMI, )
  • Open Source (BSD-style license)
  • Intensive Communications
  • Web-archived mailing lists
  • IRC (!)
  • Co-development
  • via shared CVS repository
  • joining as a new co-developer (currently)
  • get a CVS account (read-only)
  • local development contribution via existing
    KEPLER member
  • be voted in as a member/co-developer
  • Software social engineering
  • How to better accommodate new groups/communities?
  • How to better accommodate different
    usage/contribution models (core dev special
    purpose extender user)?

45
Ptolemy II/KEPLER GUI (Vergil)
Directors define the component interaction
execution semantics
Large, polymorphic component (Actors) and
Directors libraries (drag drop)
46
KEPLER/Ptolemy II GUI refined
Ontology based actor (service) and dataset search
Result Display
47
Web Services ? Actors (WS Harvester)
1
2
4
3
  • ? Minute-made (MM) WS-based application
    integration
  • Similarly MM workflow design sharing w/o
    implemented components

48
Some Recent Actor Additions
49
An early example Promoter Identification
SSDBM, AD 2003
  • Scientist models application as a workflow of
    connected components (actors)
  • If all components exist, the workflow can be
    automated/ executed
  • Different directors can be used to pick
    appropriate execution model (often pipelined
    execution PN director)

50
Reengineering a Geoscientists Mineral
Classification Workflow
51
Job Management (here NIMROD)
  • Job management infrastructure in place
  • Results database under development
  • Goal 1000s of GAMESS jobs (quantum mechanics)
    Fall/Winter04

52
(No Transcript)
53
Rapid Web Service-based Prototyping (Here
ROADNet Command Control Services for LOOKING
Kick-Off Mtg)
Source Ilkay Altintas, SDM, NLADR ROADNet
Vernon, Orcutt et al Web services Tony Fountain
et al
54
in KEPLER (w/ editable script)
Source Dan Higgins, Kepler/SEEK
55
in KEPLER (interactive session)
Source Dan Higgins, Kepler/SEEK
56
Blurring Design (ToDo) and Execution
57
Scientific Workflow Challenges
  • Typical Features
  • data-intensive and/or compute-intensive
  • plumbing-intensive (consecutive web services
    wont fit)
  • dataflow-oriented
  • distributed (remote data, remote processing)
  • user-interaction in the middle,
  • vs. (C-z bg fg)-ing (detach and reconnect)
  • advanced programming constructs (map(f), zip,
    takewhile, )
  • logging, provenance, registering back
    (intermediate) products

58
designed to fit
hand-crafted control solution also forces
sequential execution!
designed to fit
Altintas-et-al-PIW-SSDBM03
hand-crafted Web-service actor
No data transformations available
Complex backward control-flow
59
A Scientific Workflow Problem Solved (Computer
Scientists view)
  • Solution based on declarative, functional
    dataflow process network
  • ( also a data streaming model!)
  • Higher-order constructs map(f)
  • no control-flow spaghetti
  • data-intensive apps
  • free concurrent execution
  • free type checking
  • automatic support to go from piw(GeneId) to
  • PIW map(piw) over GeneId

map(f)-style iterators
Powerful type checking
Generic, declarative programming constructs
Generic data transformation actors
Forward-only, abstractable sub-workflow
piw(GeneId)
60
Promoter Identification Workflow Redesigned
map(GenbankWS) Input NM_001924,
NM020375 Output CAGTAATATGAC",GGGGACAA
AGA
61
A Research Problem Optimization by Rewriting
  • Example PIW as a declarative, referentially
    transparent functional process
  • optimization via functional rewriting possible
  • e.g. map(f o g) map(f) o map(g)
  • Technical report PIW specification in Haskell

map(f o g) instead of map(f) o map(g)
Combination of map and zip
http//kbis.sdsc.edu/SciDAC-SDM/scidac-tn-map-cons
tructs.pdf
62
A KRDIScientific Workflow Problem
  • Services can be semantically compatible, but
    structurally incompatible

Ontologies (OWL)
Compatible
(?)
SemanticType Ps
SemanticType Pt
Incompatible
StructuralType Pt
StructuralType Ps
(?)
?
?(Ps)
Source Service
Target Service
Desired Connection
Pt
Ps
Source Bowers-Ludaescher, DILS04
63
Ontology-Informed Data Transformation
(Structure-Shim)
Ontologies (OWL)
Compatible
(?)
SemanticType Ps
SemanticType Pt
Registration Mapping (Input)
Registration Mapping (Output)
StructuralType Pt
StructuralType Ps
Correspondence
?(Ps)
Generate
Source Service
Target Service
Transformation
Pt
Ps
Desired Connection
Source Bowers-Ludaescher, DILS04
64
Outline
  • Introduction CI Sample Architectures
  • Scientific Data Integration
  • Scientific Workflow Management
  • Links Crystallization Points
  • Lessons learnt Summary

65
Link-Up Crystallization Points
  • Shared (Domain) Science Vision, Goals
  • NVO, SCEC, Human Genome Project,
  • Technology Waves
  • XML, web services, WSRF, Semantic Web (OWL),
    Portlets,
  • Standards for data exchange, metadata, data
    access protocols,
  • GML, EML, netCDF, HDF, , ADN, , DODS/OpenDAP,
  • Organizations W3C, GGF, ,
  • Community ontologies
  • GO (Gene Ontology), ecoinformatics, seismology,
    geochemistry,
  • from Saulus to Paulus
  • Shared Community Tools and Tool Co-Development
  • SRB, Globus, , Kepler,

66
Shared Science Vision, Goals SCEC/CME
Southern California Earthquake Center / Community
Modeling Environment Project
  • Simulation of Seismic Wave Propagation of a
    Magnitude 7.7 Earthquake on San Andreas Fault
  • PIs Thomas Jordan, Bernard Minster, Reagan
    Moore, Carl Kesselman
  • Simulation
  • 240 Processors for 5 days
  • 47 Terabytes of data generated
  • SDSC SAC project optimized code on DataStar
    parallel computer (both MPI I/O management and
    checkpointing)
  • Future simulation Increase resolution a factor
    of 2, implies 1 PB of simulation results, 1000
    processors for 20 days

Source Reagan Moore, SDSC
67
Example NVO Community Processes
  • - created standard data encoding format (FITS
    image format)
  • - made accessible common digital holdings (sky
    survey images)
  • - defined Uniform Content Descriptors (common
    metadata attributes)
  • - created standard services (standard access
    mechanisms to catalogs
  • and surveys)
  • - created digital library (manage derived data
    products)
  • - created portals (for combining services
    interactively)
  • - created processing pipelines (for automated
    processing)
  • - created preservation environment
  • Broader impact found a new star!

Source Reagan Moore, SDSC
68
Semantic Mediation Waterfall
Ontologies
Iterative Development
Semantic Data, Service Annotation
Resource Discovery
Resource Integration
Workflow Analysis
Workflow Planning
Source Shawn Bowers, SEEK AHM04
69
GEON Dataset Generation Registration(a
co-development in KEPLER)
Makefile gt ant run
SQL database access (JDBC)
Matt,Chad, Dan et al. (SEEK)
Efrat (GEON)
Ilkay (SDM)
Yang (Ptolemy)
Xiaowen (SDM)
Edward et al.(Ptolemy)
70
KEPLER as a Melting Pot
  • A grass-roots project
  • Needed a coalition of the (really!) willing
  • Inter-project links
  • SEEK ITR, GEON ITR, ROADNet ITRs, DOE SciDAC SDM,
    Ptolemy II, NIH BIRN (coming ), UK eScience
    myGrid,
  • Intra-project links
  • e.g. in SEEK AMS ? SMS ? EcoGrid
  • Inter-technology links
  • Globus, SRB, JDBC, web services, soaplab
    services, command line tools, R, GRASS, XSLT,
  • Interdisciplinary links
  • CS, IT, domain sciences, (recently usability
    engineer)

71
Outline
  • Introduction CI Sample Architectures
  • Scientific Data Integration
  • Scientific Workflow Management
  • Links Crystallization Points
  • Lessons learnt Summary

72
Some Lessons Learnt
  • Eat your own dog-food (or at least try)
  • start using your own (CI) tools early
  • Collaboration tools
  • CVS repositories (cvsview, webcvs)
  • Mailing lists (e.g. mailman ? googlified)
  • Bugzilla (detailed tracking of tech. issues
    bugs)
  • Wiki (community authored web resource, e.g.
    high-level tech. issues)
  • Where is the XYZ repository/registry?
  • EcoGrid (SEEK) registry, GEON registry, KEPLER
    actor datasets repository,
  • UDDI what?
  • CI Melting Pots SDSC,
  • NCEAS, LTER, NLADR (w/ NCSA), KU Specify,

73
Q A
74
Further Reading
75
Related Publications
  • Semantic Data Registration and Integration
  • On Integrating Scientific Resources through
    Semantic Registration, S. Bowers, K. Lin, and B.
    Ludäscher, 16th International Conference on
    Scientific and Statistical Database Management
    (SSDBM'04), 21-23 June 2004, Santorini Island,
    Greece.
  • A System for Semantic Integration of Geologic
    Maps via Ontologies, K. Lin and B. Ludäscher. In
    Semantic Web Technologies for Searching and
    Retrieving Scientific Data (SCISW), Sanibel
    Island, Florida, 2003.
  • Towards a Generic Framework for Semantic
    Registration of Scientific Data, S. Bowers and B.
    Ludäscher. In Semantic Web Technologies for
    Searching and Retrieving Scientific Data (SCISW),
    Sanibel Island, Florida, 2003.
  • The Role of XML in Mediated Data Integration
    Systems with Examples from Geological (Map) Data
    Interoperability, B. Brodaric, B. Ludäscher, and
    K. Lin. In Geological Society of America (GSA)
    Annual Meeting, volume 35(6), November 2003.
  • Semantic Mediation Services in Geologic Data
    Integration A Case Study from the GEON Grid, K.
    Lin, B. Ludäscher, B. Brodaric, D. Seber, C.
    Baru, and K. A. Sinha. In Geological Society of
    America (GSA) Annual Meeting, volume 35(6),
    November 2003.
  • Query Planning and Rewriting
  • Processing First-Order Queries under Limited
    Access Patterns, Alan Nash and B. Ludäscher,
    Proc. 23rd ACM Symposium on Principles of
    Database Systems (PODS'04) Paris, France, June
    2004.
  • Processing Unions of Conjunctive Queries with
    Negation under Limited Access Patterns, Alan Nash
    and B. Ludäscher., 9th Intl. Conference on
    Extending Database Technology (EDBT'04)
    Heraklion, Crete, Greece, March 2004, LNCS 2992.
  • Web Service Composition Through Declarative
    Queries The Case of Conjunctive Queries with
    Union and Negation, B. Ludäscher and Alan Nash.
    Research abstract (poster), 20th Intl. Conference
    on Data Engineering (ICDE'04) Boston, IEEE
    Computer Society, April 2004.

76
Related Publications
  • Scientific Workflows
  • Kepler An Extensible System for Design and
    Execution of Scientific Workflows, I. Altintas,
    C. Berkley, E. Jaeger, M. Jones, B. Ludäscher, S.
    Mock, 16th International Conference on Scientific
    and Statistical Database Management (SSDBM'04),
    21-23 June 2004, Santorini Island, Greece.
  • Kepler Towards a Grid-Enabled System for
    Scientific Workflows, Ilkay Altintas, Chad
    Berkley, Efrat Jaeger, Matthew Jones, Bertram
    Ludäscher, Steve Mock, Workflow in Grid Systems
    (GGF10), Berlin, March 9th, 2004.
  • An Ontology-Driven Framework for Data
    Transformation in Scientific Workflows, S. Bowers
    and B. Ludäscher, Intl. Workshop on Data
    Integration in the Life Sciences (DILS'04), March
    25-26, 2004 Leipzig, Germany, LNCS 2994.
  • A Web Service Composition and Deployment
    Framework for Scientific Workflows, I. Altintas,
    E. Jaeger, K. Lin, B. Ludaescher, A. Memon, In
    the 2nd Intl. Conference on Web Services (ICWS),
    San Diego, California, July 2004.
Write a Comment
User Comments (0)