1 of 245 - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

1 of 245

Description:

1 of 245 – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 55

Provided by: robpe

Learn more at: http://bmi.osu.edu

Category:

Tags: fap | image

more less

Transcript and Presenter's Notes

Title: 1 of 245

1
1 of 245
2
Middleware Infrastructure for Large-Scale Data
Management

Umit Catalyurek, Tahsin Kurc, Joel Saltz
Department of Biomedical Informatics
The Ohio State University

VIEWS Alliance Forum July 19-20, 2004
3
Goals what the world will look like? (IMAGE
Data Management Working Group)

Identify, query, retrieve, carry out on-demand
data product generation directed at collections
of data from multiple sites/groups on a given
topic, reproduce each groups data analysis and
carry out new analyses on all datasets. Should be
able to carry out entirely new analyses or to
incrementally modify other scientists data
analyses. Should not have to worry about physical
location of data or processing. Should have
excellent tools available to examine data. This
should include a mechanism to authenticate
potential users, control access to data and log
identity of those accessing data.

4
Imaging, Medical Analysis and Grid Environments
(IMAGE)September 16 - 18 2003

OrganisersMalcolm Atkinson, Richard Ansorge,
Richard Baldock, Dave Berry, Mike Brady, Vincent
Breton, Frederica Darema, Mark Ellisman, Cecile
Germain-Renaud, Derek Hill, Robert Hollebeek,
Chris Johnson, Michael Knopp, Alan Rector, Joel
Saltz, Chris Taylor, Bonnie Webber

5
Image/Data Application Areas
Satellite Data Processing
Imaging, Medical Analysis and Grid Environments
(IMAGE) September 16 - 18 2003 e-Science
Institute, 15 South College Street, Edinburgh
Digital Pathology
Managing Oilfields, Contaminant Transport
Biomedical Image Analysis
DCE-MRI Analysis
6
Dataset Analysis and Visualization

Spatio-temporal datasets (generally low
dimensional) datasets describe physical
scenarios
Data products often involve results from ensemble
of spatio-temporal datasets
Some applications require interactive exploration
of datasets
Common operations subsetting, filtering,
interpolations, projections, comparisons,
frequency counts
Optimizations Semantic caching, caching of
intermediate results, multiple-query
optimizations

7
Molecular data and OSU Testbed Effort

Data sharing in OSU shared resource
Support for all data sharing in hundreds of
research studies in OSU comprehensive cancer
center
State of Ohio BRTT
Integration of clinical, genotype, proteomic,
histological, gene regulatory data in context of
4 translational research projects
2M per year to fund development of
bioinformatics data sharing infrastructure

8
Examples of data sources to be Integrated
Examples of data types that are generated or
referenced by OSUCCC Shared Resources
9
Center for Grid Enabled Image Analysis

Biomedical research
Ensure success of biventricular pacing, Mechanism
of ischemic cardiac injury characterization of
relationship between angiogenesis and breast,
bone cancer mouse models of tumorogenesis, role
of expression of oncogenes in placental
development (virtual slides, mouse placenta)
Radiology/cardiac imaging research
Capture, analyze time dependent cardiac imagery
EPR technology development quantify treatment
efficacy through analysis of diffusion contrast
imagery automated detection of mitoses
deconvolution, 3-D reconstruction, segmentation,
shape characterization in microscopy imagery.
Computer Science
Middleware for large scale multi-scale data, grid
metadata management, feature detection, parallel
visualization, on-demand and interactive
computing

10
Data Storage

Clusters provide inexpensive and scalable storage
50K-100K clusters at Ohio State, OSC, U.
Maryland range from 16 to 50 processors, 7.5TB to
15TB IDE disk storage
Data declustered to cluster nodes to promote
parallel I/O
Uses DataCutter and IP4G toolkits for data
preprocessing and declustering
R-tree indexing of declustered data

11
Ohio Supercomputing Center Mass Storage Testbed

50 TB of performance storage
home directories, project storage space, and
long-term frequently accessed files.
420 TB of performance/capacity storage
Active Disk Cache - compute jobs that require
directly connected storage
parallel file systems, and scratch space.
Large temporary holding area
128 TB tape library
Backups and long-term "offline" storage

IBMs Storage Tank technology combined with TFN
connections will allow large data sets to be
seamlessly moved throughout the state with
increased redundancy and seamless delivery.
12
Services

Filter-stream based distributed execution
middleware (DataCutter, STORM)
Grid based dataset management, query, on demand
data product generation (STORM, Active ProxyG,
Mako)
Supports distributed storage of XML schemas
through virtualized databases, file systems
Distributed metadata management (Mobius Global
Model Exchange)
Track metadata associated with workflows, input
image datasets, checkpointed intermediate results

13
Underlying Technologies

DataCutter
Component-based middleware for processing of
large distributed datasets.
Enables execution of user-defined data processing
components in a distributed environment.
STORM
Basic database support for large file based
scientific datasets in the Grid.
Implemented on the DataCutter framework
Efficient subsetting and user-defined filtering
of large, distributed datasets.
Object relational SQL front end

14
Underlying Technologies

IP4G (Image Processing for the Grid)
Toolkit to create parallel, distributed image
processing applications.
Built on the DataCutter framework.
VTK, ITK in a grid-based computation environment.
Active Proxy G Active Semantic Data Cache
Employ user semantics to cache and retrieve data
Store and reuse results of computations
Compilation Support
Thesis work of Henrique Andrade
Mobius
services for managing metadata definition and
metadata on the Grid
controlled metadata definitions and metadata
versioning
Federated XML-based data and metadata storage

15
DataCutter Support for Demand Driven Workflows

Many data analysis queries have multiple stages
Decompose into parallel components
Strategically place components
Create GT4 compliant services

Virtual Microscope
Iso-surface Rendering
http//www.datacutter.org/
16
Integrating DataCutter with existing Grid
toolkits SRB, Globus, NWS

SRB integration Subset and filter datasets
Globus integration DataCutter uses Globus
resource discovery, resource allocation,
authentication, and authorization services.
Network Weather Service (NWS) integration NWS
for used for system monitoring.
DataCutter will be used as a toolkit to assemble
GT4 compliant services

17
STORM Query Planning
http//storm.bmi.ohio-state.edu/
18
STORM Query Execution
19
Data Returned as stream of tuples

Stand-alone client
MPI program
MPI program provides partitioning function
Partitioning service generates mapping
Data mover sends data to appropriate MPI process
Single or replicated copies of a DataCutter
filter group

20
Digital Microscopy NPACI Telescience, BIRN
40,000 pixels

Goal
Remote access, processing of subsets of large,
distributed virtual slides
DataCutter, IP4G
Indexing, querying, caching and subsetting
Image processing by custom routines, VTK and ITK
layered on DataCutter.
Use of heterogeneous, distributed clusters for
data processing.

40,000 pixels
Query
DataCutter
Telescience Portal
21
Telescience Portal and DataCutter
22
Virtual Slide Cooperative Study Support

Childrens Oncology Group, CALGB Cooperative
Studies
60 slides/day 120 GB/day compressed, 3 TB/day
uncompressed
Remote review of slides
Computer assisted tumor grading
3-D Reconstruction
Tissue Microarray support
CALGB began November 2003, Childrens Oncology in
Spring 2004

23
Distributed, Federated and Integrated

Consortia Group Support
Virtual slides
OSC 420 on-line Terabyte storage system
CALGB COG
OSUs Virtual Placenta Project
embryonic development and gene expression
BIRN OSU/UCSD multi-photon project
Detect rare mitoses in mouse brain tumor model

24
Prototype Multiscale Data Analysis Pipeline

Disk based multi-scale dataset
Total 1 TB data generated by super-sampling -
visible woman used as coarse reference dataset
Preparation for multi-scale visible mouse project
will synthesize multiple imaging modalities,
microscopy, high throughput molecular data
Filters use indices to extract data subset
Interpolation to derive single structured mesh
from results of multi-scale query
Results streamed to Ohio Supercomputer parallel
renderer
Demand driven processing under client control

25
Multiscale Pipeline
26
On Demand Data Analysis The Instrumented Oilfield
27
Production Simulation via Reservoir Modeling
Monitor Production by acquiring Time Lapse
Observations of Seismic Data
Data Analysis
Model 1
Model 2
Model N
Data Analysis Tools (e.g., Visualization)
Data Analysis
New Model or Parameters
Data Management and Manipulation Tools
Revise Knowledge of Reservoir Model via Imaging
and Inversion of Seismic Data
Modify Production Strategy using an Optimization
Criteria
Data Parameters
28
Analysis of Oil Reservoir Simulation Data

Datasets
A 1.5TB Dataset
207 simulations, selected from several
Geostatistics models and well patterns
Each simulation is 6.9GB 10,000 time steps,
9,000 grid elements, 8 scalars 3 vectors 17
variables
A 5TB Dataset
500 simulations, selected from several
Geostatistics models and well patterns
Each simulation is 10GB 2,000 time steps, 65K
grid elements, 8 scalars 3 vectors 17
variables
Stored at
SDSC HPSS and 30TB Storage Area Network System
UMD 9TB disks on 50 nodes PIII-650, 768MB,
Switched Ethernet
OSU 7.2TB disks on 24 nodes PIII-900, 512MB,
Switched Ethernet
Data Analysis
Economic model assessment
Bypassed oil regions
Representative Realization Selection for more
simulations

29
Example Bypassed Oil

Query Find all the datasets in D that have
bypassed oil pockets with at least Tcc grid
cells.

RD -- Read data filter. Access data sets.
CC -- Connected component filter. Perform
connected component analysis to find oil regions
per time step.
MT Merge over time. Combine over multiple time
steps for bypassed oil.

30
Seismic Modeling of Reservoirs
DataCutter
Reservoir Datasets
Reservoir Datasets
31
Seismic Modeling of Reservoirs
Reservoir Datasets
Reservoir Datasets
STORM
32
Seismic Data Analysis STORM On Demand
Processing of 1.5 TB Seismic Dataset
Survey
Line
Sp (or CDP) source position
Traces
Array
Receiver group receiver group position
Component
33
Multi-Query OptimizationActive Proxy G
q1

Goal minimize the total cost of processing a
series of queries by creating an optimized access
plan for the entire sequence Kang, Dietz, and
Bhargava
Approach minimize the total cost of processing a
series of queries through data and computation
reuse
IPDPS2002,SC2002,ICS02

q2
This blue slab is the same as in q1
We have seen the pieces of q3 computed for other
queries in the past
q3
34
What does it buy?(Digital microscopy)
Average Execution Time Per Query
50
45
40

12 clients
4 x 16 VR queries
8 x 32 VM queries
4 processors (up to 4 queries simultaneously)

35
30
Time (s)
25
20
15
10
128M
192M
256M
320M
PDSS size
Reuse results of identical queries
Disabled
Active Semantic Cache
35
Active ProxyG Functional Components

Query Server
Lightweight Directory Service
Workload Monitor Service
Persistent Data Store Service

.........
Client 1
Client 2
Client
k
query workload
Active Proxy-G
Query Server
Persistent
Lightweight
Workload
Data
Directory
Monitor
Store
Service
Service
Service
directory updates
workload updates
subqueries
Application
Application
Application
Application
Application
Application
.........
Server
Server
Server
Server
Server
Server
I
II
n
I
II
n
36
Automatic Data Virtualization

Scientific and engineering applications require
interactive exploration and analysis of datasets.
Applications developers generally prefer storing
data in files
Support high level queries on multi-dimensional
distributed datasets
Many possible data abstractions, query interfaces
Grid virtualized object relational database
Grid virtualized objects with user defined
methods invoked to access and process data
A virtual relational table view
Large distributed scientific datasets

Data Virtualization
Data Service
37
System Architecture
SELECT FROM IPARS WHERE RID in (0,6,26,27)
AND TIME1000 AND TIME0.7
AND SPEED(OILVX, OILVY, OILVZ)operations Subsetting, filtering, user defined
filtering
38
Comparison with hand written codes
Dataset stored on 16 nodes. Performance
difference is within 17, With an average
difference of 14.
Dataset stored on a single node. Performance
difference is within 4.
39
Components of Meta-data Descriptor

Describe attributes, location of files, layout of
data in files, indices

40
Dataset Descriptor Example
41
Active Projects/Funding

National Science Foundation National Middleware
Infrastructure
National Science Foundation ITR on the
Instrumented Oilfield (Dynamic Data Driven
Application Systems)
National Science Foundation NGS An Integrated
Middleware and Language/Compiler for Data
Intensive Applications in Grid Environment
Center for Grid Enabled Image Biomedical Image
Analysis (NIH,NBIB, NIGMS)
Biomedical Research Technology Transfer
Partnership Award, Biomedical Informatics
Synthesis Platform (State of Ohio)
Department of Energy Data Cutter Software
Support for Generating Data Products from Very
Large Datasets
NCI Overcoming Barriers to Clinical Trial
Accrual
OSU Cancer Center Shared Resource

42
Mobius

Middleware system that provides support for
management of metadata definitions (defined as
XML schemas) and efficient storage and retrieval
of data instances in a distributed environment.
Mechanism for data driven applications to cache,
share, and asynchronously communicate data in a
distributed environment
Grid based distributed, searchable, and shareable
persistent storage
Infrastructure for grid coordination language

http//projectmobius.osu.edu/
43
Global Model Exchange

Store and link data models defined inside
namespaces in grid.
Enables other services to publish, retrieve,
discover, remove, and version metadata
definitions
Services composed in a DNS-like architecture
representing parent-child namespace hierarchy
When a schema is registered in GME, it is stored
in under the name and name space specified by the
application schema is assigned a version number

44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
Functioning prototype cost characterization

System prototype constructed
Versioned grid schema management, database
creation, insertion, query implemented
Benchmarks carried out involving different schema
shapes and sizes

49
System Architecture
50
Image Processing Pipeline with Checkpointing
51
Related Work

GGF
Grid Middleware Globus, Network Weather Service,
GridSolve, Storage Resource Broker, CACTUS,
CONDOR
Common Component Architecture
Query, indexing very large databases Jim Gray
Microsoft keyhole.com
Close relationship to much viz work

52
Multiscale Laboratory Research Group
Ohio State University Joel Saltz Gagan
Agrawal Umit Catalyurek Tahsin Kurc Shannon
Hastings Steve Langella Scott Oster Tony
Pan Benjamin Rutt Sivaramakrishnan (K2) Michael
Zhang Dan Cowden Mike Gray
The Ohio Supercomputer Center Don Stredney Dennis
Sessanna Jason Bryan
University of Maryland Alan Sussman Henrique
Andrade Christian Hansen
53
Center on Grid Enabled Image Processing

Joel Saltz
Michael Caligiuri
Charis Eng
Mike Knopp
DK Panda
Steve Qualman
Jay Zweier

54
Instrumented Oilfield Collaborators
Manish ParasharManish Agarwal Electrical and
Computer Eng. Dept. Rutgers University
Alan SussmanChristian Hansen Computer Science
Department University of Maryland
Joel Saltz Umit CatalyurekMike Gray Tahsin
Kurc Shannon Hastings Steve Langella Krishnan
Sivaramakrishnan Tyler Gingrich Biomedical
Informatics Department The Ohio State University
Arcot Rajasekar Mike Wan San Diego Supercomputer
Center
Mary Wheeler Hector Klie Malgorzata Peszynska
Ryan Martino University of Texas at Austin

Write a Comment

User Comments (0)