Title: 1 of 245
11 of 245
2Middleware Infrastructure for Large-Scale Data
Management
- Umit Catalyurek, Tahsin Kurc, Joel Saltz
- Department of Biomedical Informatics
- The Ohio State University
VIEWS Alliance Forum July 19-20, 2004
3Goals what the world will look like? (IMAGE
Data Management Working Group)
- Identify, query, retrieve, carry out on-demand
data product generation directed at collections
of data from multiple sites/groups on a given
topic, reproduce each groups data analysis and
carry out new analyses on all datasets. Should be
able to carry out entirely new analyses or to
incrementally modify other scientists data
analyses. Should not have to worry about physical
location of data or processing. Should have
excellent tools available to examine data. This
should include a mechanism to authenticate
potential users, control access to data and log
identity of those accessing data.
4Imaging, Medical Analysis and Grid Environments
(IMAGE)September 16 - 18 2003
- OrganisersMalcolm Atkinson, Richard Ansorge,
Richard Baldock, Dave Berry, Mike Brady, Vincent
Breton, Frederica Darema, Mark Ellisman, Cecile
Germain-Renaud, Derek Hill, Robert Hollebeek,
Chris Johnson, Michael Knopp, Alan Rector, Joel
Saltz, Chris Taylor, Bonnie Webber
5Image/Data Application Areas
Satellite Data Processing
Imaging, Medical Analysis and Grid Environments
(IMAGE) September 16 - 18 2003 e-Science
Institute, 15 South College Street, Edinburgh
Digital Pathology
Managing Oilfields, Contaminant Transport
Biomedical Image Analysis
DCE-MRI Analysis
6Dataset Analysis and Visualization
- Spatio-temporal datasets (generally low
dimensional) datasets describe physical
scenarios - Data products often involve results from ensemble
of spatio-temporal datasets - Some applications require interactive exploration
of datasets - Common operations subsetting, filtering,
interpolations, projections, comparisons,
frequency counts - Optimizations Semantic caching, caching of
intermediate results, multiple-query
optimizations
7Molecular data and OSU Testbed Effort
- Data sharing in OSU shared resource
- Support for all data sharing in hundreds of
research studies in OSU comprehensive cancer
center - State of Ohio BRTT
- Integration of clinical, genotype, proteomic,
histological, gene regulatory data in context of
4 translational research projects - 2M per year to fund development of
bioinformatics data sharing infrastructure
8Examples of data sources to be Integrated
Examples of data types that are generated or
referenced by OSUCCC Shared Resources
9Center for Grid Enabled Image Analysis
- Biomedical research
- Ensure success of biventricular pacing, Mechanism
of ischemic cardiac injury characterization of
relationship between angiogenesis and breast,
bone cancer mouse models of tumorogenesis, role
of expression of oncogenes in placental
development (virtual slides, mouse placenta) - Radiology/cardiac imaging research
- Capture, analyze time dependent cardiac imagery
EPR technology development quantify treatment
efficacy through analysis of diffusion contrast
imagery automated detection of mitoses
deconvolution, 3-D reconstruction, segmentation,
shape characterization in microscopy imagery. - Computer Science
- Middleware for large scale multi-scale data, grid
metadata management, feature detection, parallel
visualization, on-demand and interactive
computing
10Data Storage
- Clusters provide inexpensive and scalable storage
- 50K-100K clusters at Ohio State, OSC, U.
Maryland range from 16 to 50 processors, 7.5TB to
15TB IDE disk storage - Data declustered to cluster nodes to promote
parallel I/O - Uses DataCutter and IP4G toolkits for data
preprocessing and declustering - R-tree indexing of declustered data
11Ohio Supercomputing Center Mass Storage Testbed
- 50 TB of performance storage
- home directories, project storage space, and
long-term frequently accessed files. - 420 TB of performance/capacity storage
- Active Disk Cache - compute jobs that require
directly connected storage - parallel file systems, and scratch space.
- Large temporary holding area
- 128 TB tape library
- Backups and long-term "offline" storage
IBMs Storage Tank technology combined with TFN
connections will allow large data sets to be
seamlessly moved throughout the state with
increased redundancy and seamless delivery.
12Services
- Filter-stream based distributed execution
middleware (DataCutter, STORM) - Grid based dataset management, query, on demand
data product generation (STORM, Active ProxyG,
Mako) - Supports distributed storage of XML schemas
through virtualized databases, file systems - Distributed metadata management (Mobius Global
Model Exchange) - Track metadata associated with workflows, input
image datasets, checkpointed intermediate results
13Underlying Technologies
- DataCutter
- Component-based middleware for processing of
large distributed datasets. - Enables execution of user-defined data processing
components in a distributed environment. - STORM
- Basic database support for large file based
scientific datasets in the Grid. - Implemented on the DataCutter framework
- Efficient subsetting and user-defined filtering
of large, distributed datasets. - Object relational SQL front end
14Underlying Technologies
- IP4G (Image Processing for the Grid)
- Toolkit to create parallel, distributed image
processing applications. - Built on the DataCutter framework.
- VTK, ITK in a grid-based computation environment.
- Active Proxy G Active Semantic Data Cache
- Employ user semantics to cache and retrieve data
- Store and reuse results of computations
- Compilation Support
- Thesis work of Henrique Andrade
- Mobius
- services for managing metadata definition and
metadata on the Grid - controlled metadata definitions and metadata
versioning - Federated XML-based data and metadata storage
15DataCutter Support for Demand Driven Workflows
- Many data analysis queries have multiple stages
- Decompose into parallel components
- Strategically place components
- Create GT4 compliant services
Virtual Microscope
Iso-surface Rendering
http//www.datacutter.org/
16Integrating DataCutter with existing Grid
toolkits SRB, Globus, NWS
- SRB integration Subset and filter datasets
- Globus integration DataCutter uses Globus
resource discovery, resource allocation,
authentication, and authorization services. - Network Weather Service (NWS) integration NWS
for used for system monitoring. - DataCutter will be used as a toolkit to assemble
GT4 compliant services
17STORM Query Planning
http//storm.bmi.ohio-state.edu/
18STORM Query Execution
19Data Returned as stream of tuples
- Stand-alone client
- MPI program
- MPI program provides partitioning function
- Partitioning service generates mapping
- Data mover sends data to appropriate MPI process
- Single or replicated copies of a DataCutter
filter group
20Digital Microscopy NPACI Telescience, BIRN
40,000 pixels
- Goal
- Remote access, processing of subsets of large,
distributed virtual slides - DataCutter, IP4G
- Indexing, querying, caching and subsetting
- Image processing by custom routines, VTK and ITK
layered on DataCutter. - Use of heterogeneous, distributed clusters for
data processing.
40,000 pixels
Query
DataCutter
Telescience Portal
21Telescience Portal and DataCutter
22Virtual Slide Cooperative Study Support
- Childrens Oncology Group, CALGB Cooperative
Studies - 60 slides/day 120 GB/day compressed, 3 TB/day
uncompressed - Remote review of slides
- Computer assisted tumor grading
- 3-D Reconstruction
- Tissue Microarray support
- CALGB began November 2003, Childrens Oncology in
Spring 2004
23Distributed, Federated and Integrated
- Consortia Group Support
- Virtual slides
- OSC 420 on-line Terabyte storage system
- CALGB COG
- OSUs Virtual Placenta Project
- embryonic development and gene expression
- BIRN OSU/UCSD multi-photon project
- Detect rare mitoses in mouse brain tumor model
24Prototype Multiscale Data Analysis Pipeline
- Disk based multi-scale dataset
- Total 1 TB data generated by super-sampling -
visible woman used as coarse reference dataset - Preparation for multi-scale visible mouse project
will synthesize multiple imaging modalities,
microscopy, high throughput molecular data - Filters use indices to extract data subset
- Interpolation to derive single structured mesh
from results of multi-scale query - Results streamed to Ohio Supercomputer parallel
renderer - Demand driven processing under client control
25Multiscale Pipeline
26On Demand Data Analysis The Instrumented Oilfield
27Production Simulation via Reservoir Modeling
Monitor Production by acquiring Time Lapse
Observations of Seismic Data
Data Analysis
Model 1
Model 2
Model N
Data Analysis Tools (e.g., Visualization)
Data Analysis
New Model or Parameters
Data Management and Manipulation Tools
Revise Knowledge of Reservoir Model via Imaging
and Inversion of Seismic Data
Modify Production Strategy using an Optimization
Criteria
Data Parameters
28Analysis of Oil Reservoir Simulation Data
- Datasets
- A 1.5TB Dataset
- 207 simulations, selected from several
Geostatistics models and well patterns - Each simulation is 6.9GB 10,000 time steps,
9,000 grid elements, 8 scalars 3 vectors 17
variables - A 5TB Dataset
- 500 simulations, selected from several
Geostatistics models and well patterns - Each simulation is 10GB 2,000 time steps, 65K
grid elements, 8 scalars 3 vectors 17
variables - Stored at
- SDSC HPSS and 30TB Storage Area Network System
- UMD 9TB disks on 50 nodes PIII-650, 768MB,
Switched Ethernet - OSU 7.2TB disks on 24 nodes PIII-900, 512MB,
Switched Ethernet - Data Analysis
- Economic model assessment
- Bypassed oil regions
- Representative Realization Selection for more
simulations
29Example Bypassed Oil
- Query Find all the datasets in D that have
bypassed oil pockets with at least Tcc grid
cells.
- RD -- Read data filter. Access data sets.
- CC -- Connected component filter. Perform
connected component analysis to find oil regions
per time step. - MT Merge over time. Combine over multiple time
steps for bypassed oil.
30Seismic Modeling of Reservoirs
DataCutter
Reservoir Datasets
Reservoir Datasets
31Seismic Modeling of Reservoirs
Reservoir Datasets
Reservoir Datasets
STORM
32Seismic Data Analysis STORM On Demand
Processing of 1.5 TB Seismic Dataset
Survey
Line
Sp (or CDP) source position
Traces
Array
Receiver group receiver group position
Component
33Multi-Query OptimizationActive Proxy G
q1
- Goal minimize the total cost of processing a
series of queries by creating an optimized access
plan for the entire sequence Kang, Dietz, and
Bhargava - Approach minimize the total cost of processing a
series of queries through data and computation
reuse - IPDPS2002,SC2002,ICS02
q2
This blue slab is the same as in q1
We have seen the pieces of q3 computed for other
queries in the past
q3
34What does it buy?(Digital microscopy)
Average Execution Time Per Query
50
45
40
- 12 clients
- 4 x 16 VR queries
- 8 x 32 VM queries
- 4 processors (up to 4 queries simultaneously)
35
30
Time (s)
25
20
15
10
128M
192M
256M
320M
PDSS size
Reuse results of identical queries
Disabled
Active Semantic Cache
35Active ProxyG Functional Components
- Query Server
- Lightweight Directory Service
- Workload Monitor Service
- Persistent Data Store Service
.........
Client 1
Client 2
Client
k
query workload
Active Proxy-G
Query Server
Persistent
Lightweight
Workload
Data
Directory
Monitor
Store
Service
Service
Service
directory updates
workload updates
subqueries
Application
Application
Application
Application
Application
Application
.........
Server
Server
Server
Server
Server
Server
I
II
n
I
II
n
36Automatic Data Virtualization
- Scientific and engineering applications require
interactive exploration and analysis of datasets. - Applications developers generally prefer storing
data in files - Support high level queries on multi-dimensional
distributed datasets - Many possible data abstractions, query interfaces
- Grid virtualized object relational database
- Grid virtualized objects with user defined
methods invoked to access and process data - A virtual relational table view
- Large distributed scientific datasets
Data Virtualization
Data Service
37System Architecture
SELECT FROM IPARS WHERE RID in (0,6,26,27)
AND TIME1000 AND TIME0.7
AND SPEED(OILVX, OILVY, OILVZ)operations Subsetting, filtering, user defined
filtering
38Comparison with hand written codes
Dataset stored on 16 nodes. Performance
difference is within 17, With an average
difference of 14.
Dataset stored on a single node. Performance
difference is within 4.
39Components of Meta-data Descriptor
- Describe attributes, location of files, layout of
data in files, indices
40Dataset Descriptor Example
41Active Projects/Funding
- National Science Foundation National Middleware
Infrastructure - National Science Foundation ITR on the
Instrumented Oilfield (Dynamic Data Driven
Application Systems) - National Science Foundation NGS An Integrated
Middleware and Language/Compiler for Data
Intensive Applications in Grid Environment - Center for Grid Enabled Image Biomedical Image
Analysis (NIH,NBIB, NIGMS) - Biomedical Research Technology Transfer
Partnership Award, Biomedical Informatics
Synthesis Platform (State of Ohio) - Department of Energy Data Cutter Software
Support for Generating Data Products from Very
Large Datasets - NCI Overcoming Barriers to Clinical Trial
Accrual - OSU Cancer Center Shared Resource
42Mobius
- Middleware system that provides support for
management of metadata definitions (defined as
XML schemas) and efficient storage and retrieval
of data instances in a distributed environment. - Mechanism for data driven applications to cache,
share, and asynchronously communicate data in a
distributed environment - Grid based distributed, searchable, and shareable
persistent storage - Infrastructure for grid coordination language
http//projectmobius.osu.edu/
43Global Model Exchange
- Store and link data models defined inside
namespaces in grid. - Enables other services to publish, retrieve,
discover, remove, and version metadata
definitions - Services composed in a DNS-like architecture
representing parent-child namespace hierarchy - When a schema is registered in GME, it is stored
in under the name and name space specified by the
application schema is assigned a version number
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48Functioning prototype cost characterization
- System prototype constructed
- Versioned grid schema management, database
creation, insertion, query implemented - Benchmarks carried out involving different schema
shapes and sizes
49System Architecture
50Image Processing Pipeline with Checkpointing
51Related Work
- GGF
- Grid Middleware Globus, Network Weather Service,
GridSolve, Storage Resource Broker, CACTUS,
CONDOR - Common Component Architecture
- Query, indexing very large databases Jim Gray
Microsoft keyhole.com - Close relationship to much viz work
52Multiscale Laboratory Research Group
Ohio State University Joel Saltz Gagan
Agrawal Umit Catalyurek Tahsin Kurc Shannon
Hastings Steve Langella Scott Oster Tony
Pan Benjamin Rutt Sivaramakrishnan (K2) Michael
Zhang Dan Cowden Mike Gray
The Ohio Supercomputer Center Don Stredney Dennis
Sessanna Jason Bryan
University of Maryland Alan Sussman Henrique
Andrade Christian Hansen
53Center on Grid Enabled Image Processing
- Joel Saltz
- Michael Caligiuri
- Charis Eng
- Mike Knopp
- DK Panda
- Steve Qualman
- Jay Zweier
54Instrumented Oilfield Collaborators
Manish ParasharManish Agarwal Electrical and
Computer Eng. Dept. Rutgers University
Alan SussmanChristian Hansen Computer Science
Department University of Maryland
Joel Saltz Umit CatalyurekMike Gray Tahsin
Kurc Shannon Hastings Steve Langella Krishnan
Sivaramakrishnan Tyler Gingrich Biomedical
Informatics Department The Ohio State University
Arcot Rajasekar Mike Wan San Diego Supercomputer
Center
Mary Wheeler Hector Klie Malgorzata Peszynska
Ryan Martino University of Texas at Austin