Title: Data Management Practices and Challenges in Geosciences Today
1Data Management Practices and Challenges in
Geosciences Today
- Chaitan Baru
- San Diego Supercomputer Center (SDSC)
- California Institute for Telecommunications and
Information Technology (Calit2)
2Data Management
DATA COLLECTION
DATA PUBLICATION
DATA ACCESS
3Geosciences Data Management
- Atmospheric Sciences
- Meteorological data provides community focus
real time as well as archived data - Common field of interest (e.g. weather)
continental scale - Ocean Sciences
- Ship cruises and real time data from moorings.
Increasingly, integration with more diverse data
(including biological) - Field of interest is in regions (e.g. extent of
cruises) - Earth Sciences
- Broad range of data types sensor data (e.g.
seismic, GPS), field data collections (e.g.
geologic data), remote sensing (e.g. LIDAR),
analytical data (e.g. geochem, geochronology) - Broad coverage study area within a small region
(e.g. watershed), to continental and tectonics
settings - Also, managing model outputs
- need to manipulate and visualize very large
outputs from models
4GEON A Platform for Data Integration Example
GEONsearch
www.geongrid.org
5GEONsearch and GEONworkbench
Search Condition(s) spatial temporal
concept
GEON Catalog
GEONsearch
Log
Gazetteer
Geologic Age
Web services
extracted information/indexes
GEON Datasets
6GEON Registration
Ontology Registration
Dataset Registration (hosted)
Data Item (Schema) Registration (hosted /
non-hosted)
Data Item Detail Registration (values)
Service Registration
Resource Registration
7CUAHSI Hydrologic Information Systemcuahsi.sdsc.e
du
- Integrated access to federal data sources
- Web services for accessing each source
- Need to map to common metadata (ontology)
- Private workspace
- Ability to store data and derived products in
personal digital library - Integrated search
- Ability to search federal data sources as well as
digital library, with a single search command - Scientific workflows
- Access to modeling and analysis tools via
scientific workflow software, e.g. Kepler,
ModelBuilder, D2K
8Data Integration in CUAHSI HIS
From Chapter 4 System Architecture, by
Chaitan Baru, Ilya Zaslavsky, Reza
Wahadj, Hydrologic Information Systems A Status
Report, edited by David Maidment,
http//www.ce.utexas.edu/prof/maidment/CUAHSI/HISS
tatusSept15.pdf
9ROADNet Real-time Observatories, Applications
Data management Networks (courtesy John Orcutt,
Frank Vernon, SIO)
10SDSCStorage Resource Broker
11(No Transcript)
12USArray Background
- Overview
- 12 year project part of EarthScope
- Continental-scale seismic observatory for
lithosphere and deep Earth structure - Record local, regional and teleseismic
earthquakes - Major Components
- A transportable array of 400 portable, unmanned
three-component broadband seismometers deployed
on a uniform grid that will systematically cover
the US - A flexible component of 400 portable,
three-component, short-period and broadband
seismographs and 2000 single-channel high
frequency recorders
- A permanent array of high-quality,
three-component seismic stations, coordinated as
part of the US Geological Survey's Advanced
National Seismic System (ANSS), to provide a
reference array spanning the contiguous United
States and Alaska. - URLs
- http//www.earthscope.org/usarray/
- http//anf.ucsd.edu/index.html
- http//www.earthscope.org/usarray/usarray_assets/U
SArray6.mov
Courtesy Frank Vernon, SIO, Tony Fountain, SDSC
13USArray Existing Infrastructure
- Infrastructure / Data Flow
- Seismic sensors connected to dataloggers
- Dataloggers stream data to central collection
facility at SIO via IP (and other)-based
networking - New sites initially stream data into Prelim ORB
(object ring buffer) for QA/QC - Operational sites stream data into Production ORB
- Production data streams sent from SIO to IRIS for
archiving and dissemination (www.iris.edu) - Uses BRTTs Antelope sensornet middleware
throughout - Scale
- Up to 400 sites deployed at any given time
- Thousands of channels of real-time streaming data
- Status
- Currently in first wave deployment ( 100 sites)
- Between 5-20 new sites (physically) installed per
week - Transportable array sites will move every 18
months
Courtesy Frank Vernon, SIO, Tony Fountain, SDSC
14SOA Architecture Instantiated for USArray
Courtesy Tony Fountain, Neil Cotofana Cyberinfras
tructure Lab for Environmental Observing Systems
(CLEOS), SDSC
15KEPLER ROADNet Real-Time Scientific Workflows
Architecture
Straightforward Example
Seismic Waveforms
Laser Strainmeter Channels in Scientific
Workflow Earth-tide signal out
Images
other types of data
ORBserver
Real-time Packet Buffer
Near-real-time database
Courtesy John Orcutt, SIO
Scientific Workflow
16LOOKING
Laboratory for Ocean Observatory Knowledge
INtegration Grid
NSF ITR Grant
Cyberinfrastructure for Ocean Observatories
Initiative
Courtesy John Orcutt, SIO
17CHRONOS Federated Databases
- Create a dynamic, interactive and time-calibrated
framework for Earth history - Develop a network of chronostratigraphy databases
- Federated Database Design
- The following databases are part of the CHRONOS
Federated Database at SDSC, based on IBMs DB2
Information Integrator - Neptune
- PaleoStrat
- PaleoBiology
- Janus
- TimeScale
- FAUNMAP
- MIOMAP
Courtesy Doug Greer, SDSC
18Top-Level View of a Federated Database
Applications
Federated Database
Data Source A
Data Source D
Data Source C
Data Source B
19Federated Data Sources
- Geographically Distributed
- Heterogeneous
- Relational Databases most common
- Spreadsheets
- Non-relational Sources
- Web Pages / Web Services
- Flat Files
- Global Views
- Views may be virtual, or contain data
(materialized views) - Views define data in a uniform way across the
data sources - Applications can access data through these global
views, using SQL
20Example Chronos Hole_Desc View
- Uniform global-view for hole/taxa description for
Age Depth Plots application - CHRONOS Hole_Desc
- Database Name
- Hole_ID
- Elevation
- Meters_of_Section
- Taxa_Count
Courtesy Doug Greer, SDSC
21Challenges
- Efficient access to remote data
- Service interfaces to allow subsetting of data at
remote end - Efficient access to very large data
- Parallel I/O, manipulation of 10sTB of viz
output, long term storage of 100sTB of model
output - Versioning of data and metadata, and providing
provenance - Managing access to regular users vs power
users (or, privileged users)
22More Challenges
- Distributed versus centralized storage
- Warehousing vs federation
- Or should it really be
- Distributed Curation and Centralized storage?
- Long-term preservation of digital data
23Opportunities
- Standardize on Web service interfaces for tools,
applications, and data - E.g. Web Mapping Services for map image services,
services for accessing geologic maps, gravity
data, sensor data, - Develop community standards for knowledge
representation - Schemas, controlled vocabularies, ontologies
- Choose common representation system, e.g. OWL
- Meta-workflow frameworks
- Support inter-operation among different
scientific workflow systems - There may be an opportunity to work through new
GSA Division on Geoinformatics and AGU working
group on IT
24Thank You!
- Chaitan Baru
- baru_at_sdsc.edu