Title: Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru
1Distributed Software Systems Cyberinfrastructure
and GeoinformaticsChaitan Baru
- San Diego Supercomputer Center
2 Integrated Cyberinfrastructure System Source
Dr. Deborah Crawford, Chair, NSF CI Working
Committee
- Applications
- Geosciences
- Environmental Sciences
- Neurosciences
- High Energy Physics
-
DevelopmentTools Libraries
Education and Training
Discovery Innovation
Middleware Services
Hardware
3Community Cyberinfrastructure Projects
Friendly Work-Facilitating Portals Authentication
- Authorization - Auditing - Workflows -
Visualization - Analysis
Adapted from Prof. Mark Ellisman, UC San Diego
DevelopmentTools Libraries
Ecological Observatories (NEON)
Biomedical Informatics (BIRN)
High Enegy Physics (GriPhyN)
Ocean Observing (ORION)
Geosciences (GEON)
Earthquake Engineering (NEES)
Middleware Services
Hardware
Distributed Computing, Instruments and Data
Resources
4Data, Tools, Computation
- Data
- Field observations
- Laboratory analyses
- Sensor-based data (land, airborne, satellite)
- Tools
- QA/QC, simple transformations and analyses
- Complex models
- Computation
- Community codes
- Access to high-performance computing
- Data Intensive Computing
5Variety of Geoinformatics Efforts
- Data collection
- Digital data collection in the field
- When does it become cyberinfrastructure?
- Database curation
- E.g. EarthChem, Paleobiology, MorphoBank, Paleo
Pollen, etc. - When does it become tools and community codes
- Software Development
- Tools gravity and magnetics, paleogeography,
geochemistry, seismic data products, - Community codes SCEC-CME, CIG,
6Variety of Geoinformatics Efforts
- High Performance Computing
- LiDAR data management
- Seismic analyses
- Petascale initiative
- Data Integration
- E.g. CUAHSI HIS
- Also, a pressing need in projects like EarthScope
7Cyberinfrastructure The Common Platform Across
Distributed Projects
Cyberinfrastructure To provide access to
all of these resources and support
interoperability among them
Data Management And Curation
Modeling and Integration
Data Collection
Tool Development
8Example USArray Data Flow
- Deploy field sensor arrays
- Across US
- Collect data from sensor arrays and perform QA/QC
- One of the sites is SIO, San Diego
- Archive data for community access
- IRIS, Seattle
EarthScope/USArray Single project, multiple
participants.
9Survey
Example LiDAR Workflow Courtesy Chris Crosby,
ASU
Interpolate / Grid
D. Harding, NASA
Single goal Multiple projects, multiple
participants, e.g. NCALM, GEON, ASU, NASA, USGS,
Point Cloud x, y, z,
Analyze / Do Science
10GEON Cyberinfrastructure
- Funded by NSF IT Research program
- Multi-institution collaboration between IT and
Earth Science researchers - GEON Cyberinfrastructure provides
- Authenticated access to data and Web services
- Registration of data sets, tools, and services
with metadata - Search for data, tools, and services, using
ontologies - Scientific workflow environment and access to HPC
- Data and map integration capability
- Scientific data visualization and GIS mapping
11Key Informatics Areas
- Portals
- Authenticated, role-based access to cyber
resources data, tools, models, model outputs,
collaboration spaces, - Data Integration
- Search, discovery and integration of data from
heterogeneous information sources (mediation
and semantic integration) - Use of workflow systems, and access to HPC
- Ability to program at a higher level of
abstraction - Sharing of models, along with provenance
information - Gateways to HPC environments
- Management of Geospatial Information
- Using GIS capabilities, map services, geospatial
data integration - Visualization of 3D, 4D geospatial data and
information
12Distributed System Definition
- A Distributed System is
- one in which the hardware and software components
in networked computers communicate and coordinate
their activities only by passing messages, e.g.
the Internet - A Distributed Database System is
- one in which data is stored at several sites,
each managed by a database system (DBMS) that can
run independently
13Distributed System Models
14Remote Service Invocation
- TCP/IP
- Basic Internet protocol for computer
communications - Platform for building a number of other open or
proprietary, higher-level communications
protocols - Communication at a higher-level of abstraction
- http
- Open protocol based on TCP/IP for the Web
- Fixed set of verbs (actions) used to transfer
HTML documents - CORBA, Java RMI
- Protocols based on an object model
15SDSC Storage Resource Broker Virtualizing
storage
User
Resource, Mthd, User
Metadata Extraction
User Defined
Remote Proxies
MCAT
Dublin Core
DataCutter
Application Meta-data
http//www.sdsc.edu/srb
16SRB Client/Server Model
Data are requested using an SRB ID and a file
abstraction (open, close, read, write)
SRB Client
Network
SRB Server
17OpenDAP
18OpenDAP
From Peter Cornillon Jim Gallagherhttp//www.o
pendap.org/support/stennis_tutorial.html
19OpenDAP Data Request
- Data are requested with a URL.
- http//www.cdc.noaa.gov/cgi-bin/nph-nc/datasets/Re
ynolds_sst - Protocol Machine name OPeNDAP server
Directory File name
- User can impose a constraint on the data to be
acquired from a data set by appending a
constraint expression to the end of the URL
20Remote Service Invocation with Web Services
- A Web Service is a simple protocol for invoking
remote services on the Web. It is - A network endpoint, i.e. server, that
implements one or more ports. - Each port is defined by the message types that
accepts and the messages it returns. - Specified by a Web Service Definition Language
xml document. - Given the WSDL for a web service you know all you
need to interact with it. - Web Service Standards also exist for security,
policy, reliability, addressing, notification,
choreography and workflow. - It is the basis for MS .NET, IBM Websphere, SUN,
Oracle, BEA, HP, - It is the basis for the new Grid standards like
WSRF and OGSA.
21Web Site vs Web ServiceFrom Building Grid
Applications and Portals, An Approach Based on
Components, Web Services and Workflow Tools,
Gannon et al, Euro-Par 2004
- Web Site
- Designed to pass http get/post/put request to
between a browser and a web server. - Google has a web site.
- Web Service
- Designed for services to talk to other services
by exchanging xml messages - Google also provides a web service so Google may
be used in distributed apps
Web Server
Web Service
Web Service
Web Service
22Grid ServicesFrom Building Grid Applications
and Portals, An Approach Based on Components, Web
Services and Workflow Tools, Gannon et al,
Euro-Par 2004
- Grid A distributed, heterogeneous set of
resources - Integrated by a pervasive layer of services
- Goal allow users to view it as a single system
- More than the Internet (which forms part of the
resource layer) - Builds on the Web by building on web services
Web Services Resource Framework Web Services
Notification
Physical Resource Layer
23Access Interfaces and Levels of Access
- Web service, native application program
interface, ODBC/JDBC, filesystem
Application can also be wrapped as a Web Service
SOAP server stack
WSDL and SOAP
SRB, OpenDAP, etc
Web Server stack
URLs and http
Application Program
DBMS
Expose ODBC/JDBC interface (and full SQL)
filesystem
Mount remote filesystems
24Authentication
Network
Server 1
Client A
Client-side authentication
Server-side authentication
?
?
25Common Authentication
Certificate Authority
Obtain Credentials
Verify Credentials
Client
Server 1
Server 2
Server 3
Invoke with Credentials
26Grid Account Management Architecture (GAMA)
Single sign-on in GEON (also used in a number of
other projects)Karan Bhatia, Kurt Mueller,
Choonhan Youn, Sandeep Chandra
gama
create user
DB
gridportlets
GridSphere
import user
OGSA Grid services wrapper
retrieve credential
Servlet container
Java keystore
Portal server 1
retrieve credential
Portal server 2
Servlet container
Java keystore
GAMA server
Stand-alone applications
27Systems Issues
- Load Balancing, Failover, Replication
Server 1
Multiple servers for load balancing, failover
Server 2
Client
Server 3
28Distributed Data Access
- What is the issue?
- Ability to access data stored in multiple,
different databases using a single request, e.g. - Get geologic information from multiple geologic
databases - Get employee information from all branches
- Ability to update data stored in multiple
databases, e.g. - Transfer salary amount from University to my bank
account - Transfer funds from Visa account to vendors
account
29Distributed data access
Sources may be data repositories or metadata
catalogs
Client
How about creating a cached local copy?
Homogeneous mySQL
mySQL mySQL
Heterogeneous mySQL
Oracle DB2
Database 1
Database 2
Database 3
30Data Warehousing
But, warehouse data could be stale, i.e. out of
synch with source data
Client
Data Warehouse (common schema)
1. Load data from sources to warehouse
Data Source 1
Data Source 2
Data Source 3
31Data integration via middleware
Client
Data integration Middleware (aka Mediator)
Database 1
Database 2
Database 3
32Warehousing vs Mediation
- Warehousing User ETL to massage local data to
fit into a common global, warehouse schema - Mediation Modify user query to match schemas
exported by each source - But, which schema does the user query?
- The Integrated View Schema
- Sources export a view (the export schema)
- Federated databases
- Local sources belong to different administrative
domains, i.e. different owners. - Local autonomy
33The Canonical Mediator / Wrapper Architecture
Client Application
Wrapper processes could execute at sources, at
mediator, or elsewhere
Q1
Export view in mediator data model
Local view in local data model
34Example A Relational Mediator
Client Application
Mediator (Relational data model)
Wrapper
Wrapper
Shape file
Relational DBMS e.g. PostGIS
35Example A Shape-file Based Mediator
Client Application
Mediator (Shape file-based data model)
Wrapper
Wrapper
Shape file
Relational DBMS e.g. PostGIS
36Example An XML Mediator
User / Applications
Mediator (XML-based data model, e.g. GML)
Wrapper
Wrapper
Wrapper
Shape file
XML file e.g. ArcXML
Relational DBMS e.g. PostGIS
37User Authentication and Access Control
How about using GAMA for authentication?
1. User authenticates to system
Client Application
2. User connects to mediator (passes credentials
to mediator)
Mediator
- Mediator connects to sources
- Using original user credentials
- Or, mapped credentials (role-based access)
Wrapper
Wrapper
4. Need to define users or roles in sources
Data source 1
Data source 2
38Different types of heterogeneity in data
integration
- Platform heterogeneity different OS platforms
- DBMS heterogeneity different database systems,
e.g. SQLServer, mySQL, DB2 - Data type heterogeneity
- Schema heterogeneity
- Heterogeneity in units, accuracy, resolution
- Semantic heterogeneity
39Schema Integration
- A long standing Computer Science problem
- Simple case
- Mediator View
- (SampleID varchar, Rock_Type varchar, Age int)
- In Source2 Table, map Age to int
Wrapper
Source 1 Table
Source 2 Table
Wrapper convert between int and varchar for Age
40Another integration scenario
Source 1 Table
Sample ID Rock type Eon Era
Period varchar varchar varchar
varchar varchar
Source 2 Table
Sample ID Rock type Age varchar
varchar varchar
Phanerozoic/mesozoicjur
- Mediator View
- (SampleID varchar, Rock_Type varchar, Age
varchar, Era varchar, Period varchar) - In Source 2 Table, parse Age to obtain
sub-components of the field
41A more advanced integration scenario
Sample ID Rock type Eon Era
Period varchar varchar varchar
varchar varchar
Source 1 Table
150
- Mediator View (SampleID varchar, Rock_Type
varchar, Eon varchar, Era varchar, Period
varchar) - Same as Source1 table schema
- Query Get rock types for all rocks from the
Jurassic period
42Doing the integration
- Query sent to mediator
- SELECT DISTINCT(Rock_Type) FROM Mediator_View
WHERE PeriodJurrasic - Query to Source 1
- SELECT DISTINCT(Rock_Type) FROM Source1_Table
WHERE PeriodJurrasic - For Source2, need to map PeriodJurassic to Age
values
43Query fragment sent to Source 2
- SELECT DISTINCT (S2.Rock_Type)
- FROM
- Source2_Table S2,
- Geologic_Time_Table GT
- WHERE
- GT.Period Jurrasic AND
- (S2.Age gt GT.Min) AND
- (S2.Age lt GT.Max)
Where is the Geologic_Time table stored ?
44Data Integration Carts
- Integrating data sets without explicitly creating
views - An example request
- Plot all gravity data points that fall within
the spatial extent of rocks of a given type, in
the Rocky Mountain testbed region - Use GEONsearch to find all gravity and geologic
data using bounding box for Rocky Mountain
testbed region - Need gazeteer / spatial ontology to determine
Rocky Mountain region - Need to know classification of datasets (as
gravity and geology) - Intersect extent of gravity and geologic datasets
(from metadata) with extent of Rocky Mountain
region - Plot gravity point data that fall within polygons
of rocks of given type
45Ad hoc integration
Search Metadata Catalog Geologic and
gravity data in Rocky Mountains
GEONsearch
Data Integration Cart
46Data Registration
Item Registration (Schema registration)
Item Detail Registration
47(No Transcript)
48Another complex query
- Query Get rock types for all rocks from the
mesozoic era - Easy to do for Source 1 Era Mesozoic
- For Source 2
- Need to find numeric age range for Mesozoic
- Find age range across all subclasses of Mesozoic
(Cretaceous, Jurassic, Triassic) - Select all Source 2 Table records whose age range
falls within the Mesozoic age range