Scientific Data Management - PowerPoint PPT Presentation

About This Presentation
Title:

Scientific Data Management

Description:

– PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 18
Provided by: sdm5
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Scientific Data Management


1
Scientific Data Management Center (Integrated
Software Infrastructure Center ISIC) Arie
Shoshani All Hands Meeting March 26-27,
2002 http//sdm.lbl.gov/sdmcenter (http//sdmcent
er.lbl.gov)
2
(No Transcript)
3
Original Goals and Framework
  • coordinated framework for the
  • unification,
  • development,
  • deployment, and
  • reuse
  • of scientific data management software
  • Framework
  • 4 areas ( glue)
  • Very large, distributed, heterogeneous, data
    mining ( agent technology)
  • 4 tier levels
  • Storage, file, dataset, federated data

4
Task Diagram
4) Distributed, heterogeneous data access
d) Dataset Federation Level
Multi-tier metadata system for querying
heterogeneous data sources (LLNL, Georgia Tech)
Knowledge-based federation of heterogeneous
databases (SDSC)
1) Storage and retrieval of Very large datasets
2) Access optimization of distributed data
3) Data mining and discovery of access patterns
Analysis of application-level query patterns
(LLNL, NWU)
Optimizing shared access to tertiary
storage (LBNL, ORNL)
High-dimensional indexing techniques (LBNL)
c) Dataset Level
Multi-agent high-dimensional cluster analysis
(ORNL)
MPI I/O implementation based on file-level
hints (ANL, NWU)
b) File Level
Low level API for grid I/O (ANL)
Dimension reduction and sampling (LLNL,
LBNL)
Parallel I/O improving parallel access from
clusters (ANL, NWU)
a) Storage Level
Adaptive file caching in a distributed
system (LBNL)
Grid Enabling Technology
Optimization of low-level data storage,
retrieval and transport (ORNL)
5) Agent technology
Enabling communication among tools and data
(ORNL, NCSU)
5
Scientific Data Management ISIC
Petabytes
Petabytes
Scientific Simulations experiments
  • DOE Labs ANL, LBNL, LLNL, ORNL
  • Universities GTech, NCSU, NWU, SDSC

Terabytes
Terabytes
  • Climate Modeling
  • Astrophysics
  • Genomics and Proteomics
  • High Energy Physics

SDM-ISIC Technology
  • Optimizing shared access from mass storage
    systems
  • Metadata and knowledge- based federations
  • API for Grid I/O
  • High-dimensional cluster analysis
  • High-dimensional indexing
  • Adaptive file caching
  • Agents

Data Manipulation
Data Manipulation
20 time
  • Using SDM-ISIC technology
  • Getting files from Tape archive
  • Extracting subset of data from files
  • Reformatting data
  • Getting data from heterogeneous, distributed
    systems
  • moving data over the network

80 time
Scientific Analysis Discovery
80 time
Goals
  • Optimize and simplify
  • access to very large datasets
  • access to distributed data
  • access of heterogeneous data
  • data mining of very large datasets

Scientific Analysis Discovery
20 time
Current
Goal
6
Benefits to Applications
  • Efficiency
  • Example by removing I/O bottlenecks matching
    storage structures to the application
  • Effectiveness
  • Example by making access to data from tertiary
    storage or various sites on the data grid
    transparent, more effective data exploration is
    possible
  • New algorithms
  • Example by developing a more effective
    high-dimensional clustering technique for large
    datasets, discovery of new correlations are
    possible
  • Enabling ad-hoc exploration of data
  • Example by enabling a run and render
    capability to visualize simulation output while
    the code is running, it is possible to monitor
    and steer a long-running simulation

7
How to execute plan?
  • Executive Committee
  • Made of area leaders
  • Organize into projects
  • Led by area leaders
  • Common theme
  • Multiple tasks combine into common goal
  • All tasks covered (some in more than one project)
  • Initially focus on one primary application area
    (more better)
  • Focus on one (or more) application scientists
    contacts
  • Focus on specific scenarios that represent real
    needs
  • Conference calls
  • Every Monday
  • Cycle on Project P1-P4
  • Open to all
  • (Arie Ekow attend all)
  • Quarterly reports
  • Half yearly all-hands

8
Organization of Projects P1, P2, P3, P4
4) Distributed, heterogeneous data access
d) Dataset Federation Level
Multi-tier metadata system for querying
heterogeneous data sources (LLNL, Georgia Tech)
Knowledge-based federation of heterogeneous
databases (SDSC)
1) Storage and retrieval of Very large datasets
2) Access optimization of distributed data
3) Data mining and discovery of access patterns
Analysis of application-level query patterns
(LLNL, NWU)
Optimizing shared access to tertiary
storage (LBNL, ORNL)
High-dimensional indexing techniques (LBNL)
c) Dataset Level
Multi-agent high-dimensional cluster analysis
(ORNL)
MPI I/O implementation based on file-level
hints (ANL, NWU)
b) File Level
Low level API for grid I/O (ANL)
Dimension reduction and sampling (LLNL,
LBNL)
Parallel I/O improving parallel access from
clusters (ANL, NWU)
a) Storage Level
Adaptive file caching in a distributed
system (LBNL)
Grid Enabling Technology
Optimization of low-level data storage,
retrieval and transport (ORNL)
5) Agent technology
Enabling communication among tools and data
(ORNL, NCSU)
9
Projects and Primary Application Areas
  • Organized ourselves into 4 projects
  • (P1) Heterogeneous Data Integration (biology)
  • LLNL, SDSC, GATECH, NCSU, ORNL
  • (P2) Data Mining and Access Pattern Discovery
    (Climate, Astrophysics)
  • LLNL, ORNL, LBNL
  • (P3) Efficient Access from Large Datasets
    (HENP, Combustion)
  • LBNL, ORNL
  • (P4) Parallel Disk Access Grid-IO
    (Astrophysics, Climate)
  • ANL, NWU, LLNL

10
Projects and Primary Application Areas
  • Organized ourselves into 4 projects
  • (P1) Heterogeneous Data Integration (biology)
  • LLNL - Terence
  • SDSC Amarnath, Bertram, Ilkay
  • GATECH Ling, Calton students
  • NCSU Mladen Students
  • ORNL Tom
  • (P2) Data Mining and Access Pattern Discovery
    (Climate, Astrophysics)
  • LLNL Chandrika, Ghaleb, Imola
  • ORNL Nagiza, George, Tom
  • LBNL Ekow

11
Projects and Primary Application Areas
  • Organized ourselves into 4 projects
  • (P3) Efficient Access from Large Datasets
    (HENP, Combustion)
  • LBNL John, Ekow, Arie postdoc
  • ORNL Randy, Dan
  • (P4) Parallel Disk Access Grid-IO
    (Astrophysics, Climate)
  • ANL Bill, Rob, Rajiv
  • NWU Alok, Wei-Kang students
  • LLNL Ghaleb
  • Area leader at Large
  • Tom

12
Focus on real needs
  • Selected specific short term goals scenarios
  • (P1) Heterogeneous Data Integration (biology)
  • Microarray analysis workflow scenario
  • (P2) Data Mining and Access Pattern Discovery
    (Climate, Astrophysics)
  • Run and Render scenario for Astrophysics
  • Dimensionality reduction for Climate model
  • (P3) Efficient Access from Large Datasets (HENP)
  • STAR analysis framework
  • (P4) Parallel Disk Access Grid-IO
    (Astrophysics, Climate)
  • FLASH codes for Astrophysics
  • NetCDF using MPI-IO for Climate Modeling Fusion

13
Application Scientists Contacts
  • Close collaboration with individuals
  • Matt Coleman - LLNL (Biology)
  • Tony Mezzacappa ORNL (Astrophysics)
  • Ben Santer - LLNL, John Drake - ORNL (Climate)
  • Doug Olson - LBNL, Wei-Ming Zhang Kent (HENP)
  • Wendy Koegler Sandia L. (Combustion)
  • Mike Papka - ANL (Astrophysics Vis)
  • Mike Zingale U of Chicago (Astrophysics)
  • John Michalakes NCAR (Climate)

14
Organization of Meeting
  • First day
  • Applications perspective on data management needs
  • Explain why the need
  • Say what hurts the most
  • Technical details of current work and existing
    software
  • By project
  • Talks led by Area Leaders
  • Second day
  • Discuss and develop plans 4 breakout sessions
  • Specific technical goals in next half year
  • SDM-ISIC people involved
  • Application people involved
  • Estimated schedule
  • Longer term projections (2-3 years)
  • Identify potential new applications future
    focus
  • Planning
  • Conference calls reporting
  • Intellectual property
  • CVS repositories

15
Agenda - Morning
Day 1, March 26   800 Introduction and opening
remarks Arie Shoshani 815 Comments by
DOE Program Manager John Van
Rosendale 830 Astrophysics Perspective
Tony Mezzacappa, ORNL 915 Climate Perspective
John Drake, ORNL 1000 1015 Break   1015
HEP Perspective Doug Olson, LBNL 1100
Biology Perspective Dave Nelson,
LLNL 1145 Putting software into production
Randy Burris, ORNL 1200 Lunch
16
Agenda Afternoon
  • 100 PM
  • (P1) Heterorgeneous Data Access
  • Area Leader Terence Critchlow
  • - Supporting Heterogeneous Data Access in
    Genomics
  • Presenter Terence Critchlow
  • Context-sensitive Service Composition for Support
    of Scientific Workflows
  • Presenter Mladen A. Vouk
  • - XWRAPComposer A wrapper generation system for
    Integrating Bioinformatics Data Sources
  • Presenter Ling Liu
  • - Constructing Workflows by Integrating
    Interactive Information Sources
  • Presenters Amarnath Gupta Ilkay Altintas
  •  
  • 200 PM
  • P2) Data Mining and Access Pattern Discovery
  • Area Leader Nagiza Samatova
  • - ASPECT Adaptable Simulation Product
    Exploration and Control Toolkit
  • presenter Nagiza Samatova
  • - Dimension Reduction and Sampling
  • presenter Imola Fodor

330 PM (P3) Efficient Access from Large
Datasets area Leader Arie ShoshanI -
Supporting Ad-hoc Data Exploration for Large
Scientific Databases presenter Arie
Shoshani - Efficient Bitmap Indexing Techniques
for Very Large Datasets presenter John
Wu - Shared Disk File Caching Taking into Account
Delays in Space Reservations, Transfer, and
processing presenter Ekow Otoo - Optimizing
Shared Access to Tertiary Storage presenter
Randy Burris   430 PM (P4) Parallel Disk Access
Grid-IO Area Leaders Bill
Gropp and Alok Choudhary - Parallel and Grid I/O
Infrastructure presenter Rob Ross - Enabling
High Performance Application I/O presenter
Wei-keng Liao   530 Comments from application
people (1 hour) (free form discussion)
17
Agenda Day 2
  • 800 Welcome and logistics
  • 830 Recap and planning
  • 930 Project Breakout meetings (2 Hours)
  • Specific technical goals in next half year
  • SDM-ISIC people involved
  • Application people involved
  • Estimated schedule
  • Longer term projections (2-3 years)
  • Identify potential new applications future
    focus
  •  
  • Lunch
  •  
  • 100 Project breakout meetings (2 Hours)
  • 300 Summary of meetings (2 Hour)
  • (30 min per project)
  • 500 Conclusion and planning
Write a Comment
User Comments (0)
About PowerShow.com