Title: STORM
1STORM
- Umit V Catalyurek
- Multiscale Computing Lab
- Biomedical Informatics Department
- The Ohio State University
2Roadmap
- Motivating Applications
- Oil Reservoir Management and Optimization
- Characteristics, Goals, and Challenges
- Middleware Systems
- STORM
- System Design
- Automatic Data Virtualization
- Results
3Applications associated with Large Datasets
Satellite Data Processing
Digital Pathology
Managing Oilfields, Contaminant Transport
Derivation of macroscopic materials properties
from MD simulations
DCE-MRI Analysis
4Analysis of Confocal Microscopy Images
- Solving aggregate queries involving Sum or Count
operations on spatial data - Application domains
- OLAP (On-Line Analytical Processing)
- Geographic data
- Image datasets
- Sample query
SELECT Add(Value(x,y)) FROM Image WHERE (x,y)
in POLYGON lt(10,20),(300,400)gt
5Applications Oil Reservoir Management
- Oil Reservoir Simulations
- Seismic Data Analysis
6Implementing effective oil and gas production
Detect and track changes in data during
production Invert data for reservoir
properties Detect and track reservoir
changes Assimilate data reservoir properties
into the evolving reservoir model Use simulation
and optimization to guide future production
7Data Querying and Processing
Seismic Data
Reservoir Simulations
8Characteristics, Commonalities
- Spatio-temporal datasets (generally low
dimensional) datasets describe physical
scenarios - Multi-dimensional, Multi-resolution, Multi-scale
- Very large file-based datasets
- Tens of gigabytes to 100 TB data
- Data is stored in a distributed collection of
files - Lots of datasets, lots of files
- Data products often involve results from ensemble
of spatio-temporal datasets - Some applications require interactive exploration
of datasets - Common operations subsetting, filtering,
interpolations, projections, comparisons,
frequency counts - Modeling and management of data analysis workflows
9Data Services
- Distributed data processing support
- Grid based data virtualization, data management,
query, on demand data product generation - Distributed metadata and data management
- Track metadata associated with data and data
analysis workflows
10Middleware Support
- Data Virtualization STORM
- Large data querying capabilities, layered on
DataCutter - Distributed data virtualization
- Indexing, Data Cluster/Decluster, Parallel Data
Transfer - Data Analysis/Processing Workflows DataCutter
- Component Framework for Combined Task/Data
Parallelism - Filtering/Program coupling Service Distributed
C component framework - On demand data product generation
- Distributed Metadata and Data Management Mobius
- Create, manage, version data definitions
- Management of metadata and data instances
- Data integration
- Multiple Query Workloads Active Proxy-G
- Active Semantic Data Cache
- Employ user semantics to cache and retrieve data
- Store and reuse results of computations
11Data Virtualization
- Applications developers generally prefer storing
data in files - Support high level queries on multi-dimensional
distributed datasets - Many possible data abstractions, query interfaces
- Grid virtualized object relational database or
XML database - Grid virtualized objects with user defined
methods invoked to access and process data
Virtual Tables
Data Virtualization
Data Service
Scientific Datasets
12Our Approach
- Front-end
- Support a basic SQL Select query with a virtual
relational table view or a virtual XML database
view - A lightweight layer on top of datasets
- STORM runtime middleware STORM carries out query
execution, query planning - Compiler front end customizes runtime support
- Automatic customization and configuration of
runtime query support middleware
13STORM
- Support efficient selection of the data of
interest from distributed scientific datasets and
transfer of data from storage clusters to compute
clusters - Data Subsetting Model
- Virtual Tables
- Select Queries
- Distributed Arrays
SELECT ltDataElementsgt FROM Dataset-1,
Dataset-2,, Dataset-n WHERE ltExpressiongt AND
ltFilter(ltDataElementgt)gt GROUP-BY-PROCESSOR
ComputeAttribute(ltDataElementgt)
14- STORM Services
- Query
- Meta-data
- Indexing
- Data Source
- Filtering
- Partition Generation
- Data Mover
15STORM Query Planning
16STORM Query Execution
17STORM Results Selection in Seismic Data
18STORM Results
19OSC Mass Storage System
- 50 TB of performance storage
- home directories, project storage space, and
long-term frequently accessed files. - 420 TB of performance/capacity storage
- Active Disk Cache - compute jobs that require
directly connected storage - parallel file systems, and scratch space.
- Large temporary holding area
- 128 TB tape library
- Backups and long-term "offline" storage
20STORM Results
Seismic Datasets 10-25GB per file. About 30TB
of Data.
21Compiler Support
22Design Overview
- Dataset Schema Description Component
- Dataset Storage Description Component
- Dataset Layout Description Component
23 Group ROOT DATASET bh
DATATYPE IPARS DATASPACE RANK 3
DATAINDEX RID, TIME PARTS 9503, 9503,
9537, 9554, 9503, 9707, 9520,
9520 DATA DATASET
SPACIAL, DATASET POIL,
DATASET PWAT,
Group SUBGROUP
DATASET SPACIAL DATATYPE
DATASPACE SKIP 4 LINES LOOP
PARTS X SPACE Y SPACE Z SKIP 1
LINE DATA
PART in (0,1,2,3,4,5,6,7)
.0.PART.5.init DATASET
POIL DATATYPE
DATASPACE LOOP TIME SKIP 1
double LOOP PARTS POIL
DATA PART
in (0,1,2,3,4,5,6,7) .0.PART.5.0
Description file
IPARS RID INT2 TIME INT4 X FLOAT Y
FLOAT Z FLOAT POIL FLOAT PWAT FLOAT
Meta-data
Data list file
bh DatasetDescription IPARS io file Dim
17x65x65 Npart 8 Osumed1 osumed01.epn.osc.ed
u, osumed02.epn.osc.edu,
0 bh-10-1 osumed1 /scratch1/bh-10-1 1
bh-10-2 osumed1 /scratch1/bh-10-2
24Test the ability of our code generation tool
Oil Reservoir Management The performance
difference is within 410 as for Layout
0. Correctly and efficiently handle a variety of
different layouts for the same data
25Distributed Execution DataCutter
- Pipe-and-filter metaphor of data processing
- Data is streamed from producer to consumer
filters - Framework for task- and data-parallel
manipulation of large scientific data - Transparent copies of filters
- Provide grid-based distributed computation and
application-specific storage access - XML description of data and task flow
26STORM Seismic Image Reconstruction
27STORM Seismic Image Reconstruction
28STORM Data Resource
GDS
JDBC Driver
Data Resource
Storm Daemon
Data Mover
STORM instance
Filter
Extractor
29Experiment Setup
mob 8 nodes Dual 1.4 GHz AMD Optron 8 GB memory 1.5 TB local disk
Xio 16 2 Xeon 2.4 GHz 4 GB memory 7.3 TB FAStT600 disk array
Dataset Attributes Record Size Records (millions) Dataset (GB) Cluster, Num nodes
Oil Reservoir 21 84 bytes 3,840 315 Mob,03
Seismic 16 4240 bytes 247 1,056 Xio,16
TXm 6 24 bytes X 24 X / 1M Mob,01
- All nodes running linux
- Gigabit switch
30Comparison with MySQL - 1
- Varying table size.
- Per tuple cost is lesser
31Comparison with MySQL - 2
- Varying query size
- Also compare them as data resources
32Future Work Scenario 1Data Management, Access,
Integration
- Grid-level data services via OGSA-DAI
- Management of data definitions and metadata, XML
virtualization via Mobius - Object-relational virtualization and subsetting
of file based datasets via STORM - On-demand data product generation via DataCutter
- STORM, Mobius, DataCutter support data operations
on heterogeneous collections of storage and
compute clusters
OGSA-DAI
OGSA-DAI
Grid Protocols
OGSA-DAI
OGSA-DAI
33Data Management, Access, and Integration
Grid Service Protocols
Simulation Data
Grid-data Service (OGSA-DAI)
Grid-data Service (OGSA-DAI)
Grid-data Service (OGSA-DAI)
Grid-data Service (OGSA-DAI)
Seismic/Simulation Data
Seismic Data
34Scenario 2 Refactor STORM
- Refactor to handle
- XML databases
- Relational databases
- Object databases
- We should be able reuse following services
- Query planning
- Data partitioning
- Data transfer
35For more info
- Multiscale Computing Lab
- http//www.multiscalecomputing.org or
http//msc.osu.edu - STORM project web site
- http//storm.bmi.ohio-state.edu
- http//www.nsf-middleware.org
STORM is part of the NSF's Middleware Initiative
Since Release 5