STORM - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

STORM

Description:

STORM – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 36
Provided by: umit3
Category:
Tags: storm | nex | tech

less

Transcript and Presenter's Notes

Title: STORM


1
STORM
  • Umit V Catalyurek
  • Multiscale Computing Lab
  • Biomedical Informatics Department
  • The Ohio State University

2
Roadmap
  • Motivating Applications
  • Oil Reservoir Management and Optimization
  • Characteristics, Goals, and Challenges
  • Middleware Systems
  • STORM
  • System Design
  • Automatic Data Virtualization
  • Results

3
Applications associated with Large Datasets
Satellite Data Processing
Digital Pathology
Managing Oilfields, Contaminant Transport
Derivation of macroscopic materials properties
from MD simulations
DCE-MRI Analysis
4
Analysis of Confocal Microscopy Images
  • Solving aggregate queries involving Sum or Count
    operations on spatial data
  • Application domains
  • OLAP (On-Line Analytical Processing)
  • Geographic data
  • Image datasets
  • Sample query

SELECT Add(Value(x,y)) FROM Image WHERE (x,y)
in POLYGON lt(10,20),(300,400)gt
5
Applications Oil Reservoir Management
  • Oil Reservoir Simulations
  • Seismic Data Analysis

6
Implementing effective oil and gas production
Detect and track changes in data during
production Invert data for reservoir
properties Detect and track reservoir
changes Assimilate data reservoir properties
into the evolving reservoir model Use simulation
and optimization to guide future production
7
Data Querying and Processing
Seismic Data
Reservoir Simulations
8
Characteristics, Commonalities
  • Spatio-temporal datasets (generally low
    dimensional) datasets describe physical
    scenarios
  • Multi-dimensional, Multi-resolution, Multi-scale
  • Very large file-based datasets
  • Tens of gigabytes to 100 TB data
  • Data is stored in a distributed collection of
    files
  • Lots of datasets, lots of files
  • Data products often involve results from ensemble
    of spatio-temporal datasets
  • Some applications require interactive exploration
    of datasets
  • Common operations subsetting, filtering,
    interpolations, projections, comparisons,
    frequency counts
  • Modeling and management of data analysis workflows

9
Data Services
  • Distributed data processing support
  • Grid based data virtualization, data management,
    query, on demand data product generation
  • Distributed metadata and data management
  • Track metadata associated with data and data
    analysis workflows

10
Middleware Support
  • Data Virtualization STORM
  • Large data querying capabilities, layered on
    DataCutter
  • Distributed data virtualization
  • Indexing, Data Cluster/Decluster, Parallel Data
    Transfer
  • Data Analysis/Processing Workflows DataCutter
  • Component Framework for Combined Task/Data
    Parallelism
  • Filtering/Program coupling Service Distributed
    C component framework
  • On demand data product generation
  • Distributed Metadata and Data Management Mobius
  • Create, manage, version data definitions
  • Management of metadata and data instances
  • Data integration
  • Multiple Query Workloads Active Proxy-G
  • Active Semantic Data Cache
  • Employ user semantics to cache and retrieve data
  • Store and reuse results of computations

11
Data Virtualization
  • Applications developers generally prefer storing
    data in files
  • Support high level queries on multi-dimensional
    distributed datasets
  • Many possible data abstractions, query interfaces
  • Grid virtualized object relational database or
    XML database
  • Grid virtualized objects with user defined
    methods invoked to access and process data

Virtual Tables
Data Virtualization
Data Service
Scientific Datasets
12
Our Approach
  • Front-end
  • Support a basic SQL Select query with a virtual
    relational table view or a virtual XML database
    view
  • A lightweight layer on top of datasets
  • STORM runtime middleware STORM carries out query
    execution, query planning
  • Compiler front end customizes runtime support
  • Automatic customization and configuration of
    runtime query support middleware

13
STORM
  • Support efficient selection of the data of
    interest from distributed scientific datasets and
    transfer of data from storage clusters to compute
    clusters
  • Data Subsetting Model
  • Virtual Tables
  • Select Queries
  • Distributed Arrays

SELECT ltDataElementsgt FROM Dataset-1,
Dataset-2,, Dataset-n WHERE ltExpressiongt AND
ltFilter(ltDataElementgt)gt GROUP-BY-PROCESSOR
ComputeAttribute(ltDataElementgt)
14
  • STORM Services
  • Query
  • Meta-data
  • Indexing
  • Data Source
  • Filtering
  • Partition Generation
  • Data Mover

15
STORM Query Planning
16
STORM Query Execution
17
STORM Results Selection in Seismic Data
18
STORM Results
19
OSC Mass Storage System
  • 50 TB of performance storage
  • home directories, project storage space, and
    long-term frequently accessed files.
  • 420 TB of performance/capacity storage
  • Active Disk Cache - compute jobs that require
    directly connected storage
  • parallel file systems, and scratch space.
  • Large temporary holding area
  • 128 TB tape library
  • Backups and long-term "offline" storage

20
STORM Results
Seismic Datasets 10-25GB per file. About 30TB
of Data.
21
Compiler Support
22
Design Overview
  • Dataset Schema Description Component
  • Dataset Storage Description Component
  • Dataset Layout Description Component

23
Group ROOT DATASET bh
DATATYPE IPARS DATASPACE RANK 3
DATAINDEX RID, TIME PARTS 9503, 9503,
9537, 9554, 9503, 9707, 9520,
9520 DATA DATASET
SPACIAL, DATASET POIL,
DATASET PWAT,
Group SUBGROUP
DATASET SPACIAL DATATYPE
DATASPACE SKIP 4 LINES LOOP
PARTS X SPACE Y SPACE Z SKIP 1
LINE DATA
PART in (0,1,2,3,4,5,6,7)
.0.PART.5.init DATASET
POIL DATATYPE
DATASPACE LOOP TIME SKIP 1
double LOOP PARTS POIL
DATA PART
in (0,1,2,3,4,5,6,7) .0.PART.5.0

Description file
IPARS RID INT2 TIME INT4 X FLOAT Y
FLOAT Z FLOAT POIL FLOAT PWAT FLOAT
Meta-data
Data list file
bh DatasetDescription IPARS io file Dim
17x65x65 Npart 8 Osumed1 osumed01.epn.osc.ed
u, osumed02.epn.osc.edu,
0 bh-10-1 osumed1 /scratch1/bh-10-1 1
bh-10-2 osumed1 /scratch1/bh-10-2
24
Test the ability of our code generation tool
Oil Reservoir Management The performance
difference is within 410 as for Layout
0. Correctly and efficiently handle a variety of
different layouts for the same data
25
Distributed Execution DataCutter
  • Pipe-and-filter metaphor of data processing
  • Data is streamed from producer to consumer
    filters
  • Framework for task- and data-parallel
    manipulation of large scientific data
  • Transparent copies of filters
  • Provide grid-based distributed computation and
    application-specific storage access
  • XML description of data and task flow

26
STORM Seismic Image Reconstruction
27
STORM Seismic Image Reconstruction
28
STORM Data Resource
GDS
JDBC Driver
Data Resource
Storm Daemon
Data Mover
STORM instance
Filter
Extractor
29
Experiment Setup
mob 8 nodes Dual 1.4 GHz AMD Optron 8 GB memory 1.5 TB local disk
Xio 16 2 Xeon 2.4 GHz 4 GB memory 7.3 TB FAStT600 disk array
Dataset Attributes Record Size Records (millions) Dataset (GB) Cluster, Num nodes
Oil Reservoir 21 84 bytes 3,840 315 Mob,03
Seismic 16 4240 bytes 247 1,056 Xio,16
TXm 6 24 bytes X 24 X / 1M Mob,01
  • All nodes running linux
  • Gigabit switch

30
Comparison with MySQL - 1
  • Varying table size.
  • Per tuple cost is lesser

31
Comparison with MySQL - 2
  • Varying query size
  • Also compare them as data resources

32
Future Work Scenario 1Data Management, Access,
Integration
  • Grid-level data services via OGSA-DAI
  • Management of data definitions and metadata, XML
    virtualization via Mobius
  • Object-relational virtualization and subsetting
    of file based datasets via STORM
  • On-demand data product generation via DataCutter
  • STORM, Mobius, DataCutter support data operations
    on heterogeneous collections of storage and
    compute clusters

OGSA-DAI
OGSA-DAI
Grid Protocols
OGSA-DAI
OGSA-DAI
33
Data Management, Access, and Integration
Grid Service Protocols
Simulation Data
Grid-data Service (OGSA-DAI)
Grid-data Service (OGSA-DAI)
Grid-data Service (OGSA-DAI)
Grid-data Service (OGSA-DAI)
Seismic/Simulation Data
Seismic Data
34
Scenario 2 Refactor STORM
  • Refactor to handle
  • XML databases
  • Relational databases
  • Object databases
  • We should be able reuse following services
  • Query planning
  • Data partitioning
  • Data transfer

35
For more info
  • Multiscale Computing Lab
  • http//www.multiscalecomputing.org or
    http//msc.osu.edu
  • STORM project web site
  • http//storm.bmi.ohio-state.edu
  • http//www.nsf-middleware.org

STORM is part of the NSF's Middleware Initiative
Since Release 5
Write a Comment
User Comments (0)
About PowerShow.com