Title: Scientific Data Management
1Scientific Data Management Center (Integrated
Software Infrastructure Center ISIC) Arie
Shoshani All Hands Meeting March 26-27,
2002 http//sdm.lbl.gov/sdmcenter (http//sdmcent
er.lbl.gov)
2(No Transcript)
3Original Goals and Framework
- coordinated framework for the
- unification,
- development,
- deployment, and
- reuse
- of scientific data management software
- Framework
- 4 areas ( glue)
- Very large, distributed, heterogeneous, data
mining ( agent technology) - 4 tier levels
- Storage, file, dataset, federated data
4Task Diagram
4) Distributed, heterogeneous data access
d) Dataset Federation Level
Multi-tier metadata system for querying
heterogeneous data sources (LLNL, Georgia Tech)
Knowledge-based federation of heterogeneous
databases (SDSC)
1) Storage and retrieval of Very large datasets
2) Access optimization of distributed data
3) Data mining and discovery of access patterns
Analysis of application-level query patterns
(LLNL, NWU)
Optimizing shared access to tertiary
storage (LBNL, ORNL)
High-dimensional indexing techniques (LBNL)
c) Dataset Level
Multi-agent high-dimensional cluster analysis
(ORNL)
MPI I/O implementation based on file-level
hints (ANL, NWU)
b) File Level
Low level API for grid I/O (ANL)
Dimension reduction and sampling (LLNL,
LBNL)
Parallel I/O improving parallel access from
clusters (ANL, NWU)
a) Storage Level
Adaptive file caching in a distributed
system (LBNL)
Grid Enabling Technology
Optimization of low-level data storage,
retrieval and transport (ORNL)
5) Agent technology
Enabling communication among tools and data
(ORNL, NCSU)
5Scientific Data Management ISIC
Petabytes
Petabytes
Scientific Simulations experiments
- DOE Labs ANL, LBNL, LLNL, ORNL
- Universities GTech, NCSU, NWU, SDSC
Terabytes
Terabytes
- Climate Modeling
- Astrophysics
- Genomics and Proteomics
- High Energy Physics
SDM-ISIC Technology
- Optimizing shared access from mass storage
systems - Metadata and knowledge- based federations
- API for Grid I/O
- High-dimensional cluster analysis
- High-dimensional indexing
- Adaptive file caching
- Agents
Data Manipulation
Data Manipulation
20 time
- Using SDM-ISIC technology
- Getting files from Tape archive
- Extracting subset of data from files
- Reformatting data
- Getting data from heterogeneous, distributed
systems - moving data over the network
80 time
Scientific Analysis Discovery
80 time
Goals
- Optimize and simplify
- access to very large datasets
- access to distributed data
- access of heterogeneous data
- data mining of very large datasets
Scientific Analysis Discovery
20 time
Current
Goal
6Benefits to Applications
- Efficiency
- Example by removing I/O bottlenecks matching
storage structures to the application - Effectiveness
- Example by making access to data from tertiary
storage or various sites on the data grid
transparent, more effective data exploration is
possible - New algorithms
- Example by developing a more effective
high-dimensional clustering technique for large
datasets, discovery of new correlations are
possible - Enabling ad-hoc exploration of data
- Example by enabling a run and render
capability to visualize simulation output while
the code is running, it is possible to monitor
and steer a long-running simulation
7How to execute plan?
- Executive Committee
- Made of area leaders
- Organize into projects
- Led by area leaders
- Common theme
- Multiple tasks combine into common goal
- All tasks covered (some in more than one project)
- Initially focus on one primary application area
(more better) - Focus on one (or more) application scientists
contacts - Focus on specific scenarios that represent real
needs - Conference calls
- Every Monday
- Cycle on Project P1-P4
- Open to all
- (Arie Ekow attend all)
- Quarterly reports
- Half yearly all-hands
8Organization of Projects P1, P2, P3, P4
4) Distributed, heterogeneous data access
d) Dataset Federation Level
Multi-tier metadata system for querying
heterogeneous data sources (LLNL, Georgia Tech)
Knowledge-based federation of heterogeneous
databases (SDSC)
1) Storage and retrieval of Very large datasets
2) Access optimization of distributed data
3) Data mining and discovery of access patterns
Analysis of application-level query patterns
(LLNL, NWU)
Optimizing shared access to tertiary
storage (LBNL, ORNL)
High-dimensional indexing techniques (LBNL)
c) Dataset Level
Multi-agent high-dimensional cluster analysis
(ORNL)
MPI I/O implementation based on file-level
hints (ANL, NWU)
b) File Level
Low level API for grid I/O (ANL)
Dimension reduction and sampling (LLNL,
LBNL)
Parallel I/O improving parallel access from
clusters (ANL, NWU)
a) Storage Level
Adaptive file caching in a distributed
system (LBNL)
Grid Enabling Technology
Optimization of low-level data storage,
retrieval and transport (ORNL)
5) Agent technology
Enabling communication among tools and data
(ORNL, NCSU)
9Projects and Primary Application Areas
- Organized ourselves into 4 projects
- (P1) Heterogeneous Data Integration (biology)
- LLNL, SDSC, GATECH, NCSU, ORNL
- (P2) Data Mining and Access Pattern Discovery
(Climate, Astrophysics) - LLNL, ORNL, LBNL
- (P3) Efficient Access from Large Datasets
(HENP, Combustion) - LBNL, ORNL
- (P4) Parallel Disk Access Grid-IO
(Astrophysics, Climate) - ANL, NWU, LLNL
10Projects and Primary Application Areas
- Organized ourselves into 4 projects
- (P1) Heterogeneous Data Integration (biology)
- LLNL - Terence
- SDSC Amarnath, Bertram, Ilkay
- GATECH Ling, Calton students
- NCSU Mladen Students
- ORNL Tom
- (P2) Data Mining and Access Pattern Discovery
(Climate, Astrophysics) - LLNL Chandrika, Ghaleb, Imola
- ORNL Nagiza, George, Tom
- LBNL Ekow
11Projects and Primary Application Areas
- Organized ourselves into 4 projects
- (P3) Efficient Access from Large Datasets
(HENP, Combustion) - LBNL John, Ekow, Arie postdoc
- ORNL Randy, Dan
- (P4) Parallel Disk Access Grid-IO
(Astrophysics, Climate) - ANL Bill, Rob, Rajiv
- NWU Alok, Wei-Kang students
- LLNL Ghaleb
- Area leader at Large
- Tom
12Focus on real needs
- Selected specific short term goals scenarios
- (P1) Heterogeneous Data Integration (biology)
- Microarray analysis workflow scenario
- (P2) Data Mining and Access Pattern Discovery
(Climate, Astrophysics) - Run and Render scenario for Astrophysics
- Dimensionality reduction for Climate model
- (P3) Efficient Access from Large Datasets (HENP)
- STAR analysis framework
- (P4) Parallel Disk Access Grid-IO
(Astrophysics, Climate) - FLASH codes for Astrophysics
- NetCDF using MPI-IO for Climate Modeling Fusion
13Application Scientists Contacts
- Close collaboration with individuals
- Matt Coleman - LLNL (Biology)
- Tony Mezzacappa ORNL (Astrophysics)
- Ben Santer - LLNL, John Drake - ORNL (Climate)
- Doug Olson - LBNL, Wei-Ming Zhang Kent (HENP)
- Wendy Koegler Sandia L. (Combustion)
- Mike Papka - ANL (Astrophysics Vis)
- Mike Zingale U of Chicago (Astrophysics)
- John Michalakes NCAR (Climate)
14Organization of Meeting
- First day
- Applications perspective on data management needs
- Explain why the need
- Say what hurts the most
- Technical details of current work and existing
software - By project
- Talks led by Area Leaders
- Second day
- Discuss and develop plans 4 breakout sessions
- Specific technical goals in next half year
- SDM-ISIC people involved
- Application people involved
- Estimated schedule
- Longer term projections (2-3 years)
- Identify potential new applications future
focus - Planning
- Conference calls reporting
- Intellectual property
- CVS repositories
15Agenda - Morning
Day 1, March 26 800 Introduction and opening
remarks Arie Shoshani 815 Comments by
DOE Program Manager John Van
Rosendale 830 Astrophysics Perspective
Tony Mezzacappa, ORNL 915 Climate Perspective
John Drake, ORNL 1000 1015 Break 1015
HEP Perspective Doug Olson, LBNL 1100
Biology Perspective Dave Nelson,
LLNL 1145 Putting software into production
Randy Burris, ORNL 1200 Lunch
16Agenda Afternoon
- 100 PM
- (P1) Heterorgeneous Data Access
- Area Leader Terence Critchlow
- - Supporting Heterogeneous Data Access in
Genomics - Presenter Terence Critchlow
- Context-sensitive Service Composition for Support
of Scientific Workflows - Presenter Mladen A. Vouk
- - XWRAPComposer A wrapper generation system for
Integrating Bioinformatics Data Sources - Presenter Ling Liu
- - Constructing Workflows by Integrating
Interactive Information Sources - Presenters Amarnath Gupta Ilkay Altintas
-
- 200 PM
- P2) Data Mining and Access Pattern Discovery
- Area Leader Nagiza Samatova
- - ASPECT Adaptable Simulation Product
Exploration and Control Toolkit - presenter Nagiza Samatova
- - Dimension Reduction and Sampling
- presenter Imola Fodor
330 PM (P3) Efficient Access from Large
Datasets area Leader Arie ShoshanI -
Supporting Ad-hoc Data Exploration for Large
Scientific Databases presenter Arie
Shoshani - Efficient Bitmap Indexing Techniques
for Very Large Datasets presenter John
Wu - Shared Disk File Caching Taking into Account
Delays in Space Reservations, Transfer, and
processing presenter Ekow Otoo - Optimizing
Shared Access to Tertiary Storage presenter
Randy Burris 430 PM (P4) Parallel Disk Access
Grid-IO Area Leaders Bill
Gropp and Alok Choudhary - Parallel and Grid I/O
Infrastructure presenter Rob Ross - Enabling
High Performance Application I/O presenter
Wei-keng Liao 530 Comments from application
people (1 hour) (free form discussion)
17Agenda Day 2
- 800 Welcome and logistics
- 830 Recap and planning
- 930 Project Breakout meetings (2 Hours)
- Specific technical goals in next half year
- SDM-ISIC people involved
- Application people involved
- Estimated schedule
- Longer term projections (2-3 years)
- Identify potential new applications future
focus -
- Lunch
-
- 100 Project breakout meetings (2 Hours)
- 300 Summary of meetings (2 Hour)
- (30 min per project)
- 500 Conclusion and planning