Title: NPACI Data Intensive Computing Environment
1NPACI Data Intensive Computing Environment
- Reagan W. Moore
- Associate Director, Data Intensive Computing
- San Diego Supercomputer Center
- moore_at_sdsc.edu
- http//www.npaci.edu/DICE
2Current Infrastructure Development
- UCSD / SDSC
- Chaitan Baru MIX Mediation of Information using
XML - Amarnath Gupta XML wrappers, video/image sources
- Bertram Ludaescher BBQ Blended Browsing and
Querying - Richard Marciano BBQ/GIS interfaces
- Arcot Rajasekar MCAT Meta-data Catalog
- Wayne Schroeder GSI Grid Security Infrastructure
- Michael Wan SRB Storage Resource Broker
- UCSD / CSE
- Yannis Papakonstantinou XML Matching and
Structuring Language - Victor Vianu MIX
- Stanford
- Andreas Paepcke SDLIP Simple Digital Library
Interoperability Protocol - U Md
- Joel Saltz ADR Active Data Repository
3Themes
- Scientific Data Collections
- Publication of scientific data sets
- Information discovery mechanisms
- Application of NSF DLI-II Interlib technology to
NPACI - Information Models for Data
- eXtended Markup Language (XML) Document Type
Definition (DTD) - Information model for digital objects, data
collections, and presentation interfaces - Application to scientific data collections
- Digital sky, Protein Data Bank, Neuroscience
brain images - California Digital Library - Art Museum Image
Consortium
4Distributed Scientific Data Collections
5Data Collections
6Context Management using Collections
- For data to be useful, the context must be
defined - Data format - binary/integer representation
- Physical meaning - units
- Structure - geometry
- Relevance - feature annotation
- Semantics - data dictionary for attributes
- Context is preserved as meta-data attributes
within a collection
7XML Query Language
Joint development effort with UCSD CSE Database
Lab (Yannis Papakonstantinou)
8Themes
- Integration of Digital Library and Computational
Grid Technology - Information discovery mechanisms - SDLIP
- Inter-realm authentication - Grid Security
Infrastructure - Data handling systems - Storage Resource Broker
- Integration promoted through
- NSF DLI-II InterLib project
- Grid Forum
9Information Management Architecture
- Digital library community technologies
- Distributed information resources
- Digital library interoperability protocols -
SDLIP - Mediation of information using XML - MIX
- Grid Forum technologies
- Support for distributed services / procedures
- Inter-realm authentication
- GSI Grid Security Infrastructure
- Data handling system
- Storage Resource Broker, Meta-data Catalog
10Evolution of Grid Architectures
Common User Environment
Heterogeneous User Environment
Multiple Data and Compute Grids
Single Compute Grid
Open Grid Architecture
11Digital Library Architecture
Meta-data manipulation services
12Open Grid Architecture
13Open Grid Architecture
Application
Data Model Management
Remote Procedure Execution
Armada Dagents, FEL, ADR GRAM, SRB
Data Handling Systems
Information Discovery
LDAP, Database, Flat file, Object database
Condor, GASS, NILE, SRB, I-2 caching, ADR
(e.g., filtering)
Storage System Description
Dynamic Info Discovery
Storage Resources
DPSS, DFS, NFS, HPSS, ADSM, DMF, Unitree,
NASstore, DB2, Oracle, Informix, Sybase, O2,
ObjectStore, Objectivity
DTD, ADR, object class
GloPerf, Netlogger, NWS
14Open Grid Architecture
API that provides glue to underlying data
handling systems (security, scheduling, QoS,
access protocol, data format/model, adaptivity,
info discovery, location control)
Application
authentication authorization
Data Model Management
Remote Procedure Execution
Armada Dagents, FEL, ADR GRAM, SRB
Data Handling Systems
Information Discovery
Condor, GASS, NILE, SRB, I-2 caching, ADR
LDAP, Database, Flat file, Object database
(e.g., filtering)
Storage System Description
Dynamic Info Discovery
API that provides glue to underlying storage,
QoS, etc. GASS, IBP, SRB
Storage Resources
DPSS, DFS, NFS HPSS, ADSM, DMF, Unitree,
NASstore, DB2, Oracle, Informix, Sybase, O2,
ObjectStore, Objectivity
GloPerf, Netlogger, NWS
DTD, ADR, object class
15Data Handling System
- SDSC Storage Resource Broker
- Protocol transparency
- Common API for access to remote data resources
- Explicit drivers for each type of storage system
- Name transparency
- Attribute based access to data
- Location transparency
- Distribution of collection across multiple
physical resources - Time transparency
- Minimization of latency for data access
16SDSC Storage Resource Broker Meta-data Catalog
Application
Resource
User
MCAT
Dublin Core
Application Meta-data
17SRB Production Sites
- SRB Servers 18 sites, 24 hosts, 45 resources, 90
users, 350,000 data sets - SDSC - 4 hosts V1.1.4 (HPSS,DB2,Oracle,Illustra
,UnixFS,C-90Unicos) - U. Maryland V1.1.4 (HPSS, UnixFS)
- U. Michigan V1.1.3 (ADSM, UnixFS)
- UIUC(NCSA) V1.1 (Oracle, UnixFS)
- Rutgers U. V1.1.2 (UnixFS)
- CalTech V1.1.4 (HPSS, UnixFS)
- UC Berkeley V1.1.4 (UnixFS)
- Montana State U V1.1.4 (UnixFS)
- UCLA V1.1.4 (UnixFS)
- UCSB V1.1.3 (UnixFS)
- U Texas,Austin V1.1.3 (DMF, UnixFS)
- UC Davis V1.1.3 (UnixFS)
- Washington U,StL V1.1.4 (UnixFS)
- U Houston V1.1.3 (UnixFS)
- UCSC V1.1.4 (Oracle, UnixFS)
- UCSD - 2 hosts V1.1.4 (UnixFS)
- LBL V1.1.3 (UnixFS)
- LLNL V1.1.3 (DB2, UnixFS)
18Time Transparency
- How to minimize access time
- Prefetch data to local high performance disk, so
that all accesses can be done at high speed from
local resources - How to maximize data delivery
- Composite or aggregate data into a single data
set to avoid multiple accesses - Stream data at high rates using parallel I/O,
amortizing the access latency by the volume of
data that is delivered. - How to avoid congestion
- Replicate data across multiple servers
19Integrating Cache and Collections(Collection
Controlled Data)
Application
Data Model Management
GASS local data cache
ADR compositing cache
DPSS network data cache
SRB Collection Access
Database Collection
Archive Collection
File System Collection
20Grid Security Infrastructure (GSI)
- Inter-realm certificate support
- X.509 certificate support
- Support for Kerberos, DCE access
- Secure communication
- SSL
- SDSC LibAID - Authentication and Integrity of
Data - Simplified interface library to GSS-API
- Authentication through two calls
- Provided in release 1.1.5 of Storage Resource
Broker
21DICE Roadmap
Info Disovery
Interactive Browsing
Metacomputing Services
Data Handling
CDL Distributed Query
Digital Libraries/Interlib
CDL FindingAids
Technologies
Distributed Collections
MIX/ICE
Advancement of Scientific DiscoveryInformation
Discovery
SDLIP
Data Collections
XMAS/BBQ
Distributed Data Resources
SRB/MCAT
Internet
RDBMS
DB
Files
1999
2000
2001
2002
UNIX
Time
22Roadmap - Goals
- Application of digital library technology to
scientific data collections - Creation of data collections (NS, AMICO, ESS)
- Support for education through CDL
- Common information structure model across
presentation, collection, digital objects - Application of MIX to construct user interfaces,
define structure of data collection, and
structure of objects - Common information discovery interface
23Roadmap - Information Management Hierarchy
- Presentation / Information Discovery
- Collaboration/Visualization - ICE
- Visualization - Shastra, 3D visualization tools
- Information model - MIX using XML DTDs
- Collection organization
- Meta-data catalog - MCAT
- Information model - XML DTD and database DDL
- Data handling
- Storage Resource Broker - SRB
- Storage
- Archival storage system - HPSS
- Digital object model - XML DTD
24Roadmap - Integration with Metacomputing
- Common security infrastructure - GSI
- Integration of SRB with GSI - FY99
- Interoperable certificate authorities (NCSA,
NPACI, CDL) - FY99 - Interoperable data access systems
- Integration of SRB with GASS - FY99
- Integration of SRB with Legion - FY00
- Remote procedure execution
- Naming, discovery, and application of procedures
- FY00 - Linkage of procedures - FY00
- Application of XML DTD for object definition
25Roadmap - DICE
- Digital Library
- Archive support at SDSC via SRB - ADL, ELIB, CDL
(FY99) - User Interfaces / Electronic notebooks - UCB,
UCSB, UCSD (FY00) - Data collection support
- AMICO (FY99) - support educational access to
images - NS (FY00) - develop data dictionary, schema
- Digital Sky (FY00) - develop XML DTDs for
structure and access, store 2-20 TB of digital
sky images - ESS ( FY00) - integration of HPSS archives (U
Md, SDSC)
26Management of Scientific Data
DX ICE AVS
Notebook GIS wrapper
XMAS XML structure
MIX
Presentation
CDL
ADL ELIB AMICO
PDB NS Publication API
Spatial query ADEPT
Extensible Schema
MCAT
Collection
Containers GSI GASS interface
ADR Parallel I/O Remote Proc.
Globus directory Info. Discovery API
Data Handling
SRB
Distributed Nameserver
Archive
HPSS
MPI interface
GPFS interface
1998
1999
2000
2001
2002
27Coordination of Digital Library and Metacomputing
Environments
- Grid Forum
- Common implementation practice defined by working
groups - Data Access Working Group - Chair Reagan Moore
- Security Working Group - Co-chair Andrew Grimshaw
- NPACI Database Workshop
- Integration of PTE and DICE data handling systems
- NPACI Storage Resource Broker Workshop
- Integration of data collections and archival
storage
28PTE / DICE Data Handling Integration
XML DTD for data set description
29SRB Containers - Managing Archive Latency
SRB client
- Create container in a logical storage resource
containing at least one cacheable resource - Create objects in containers
- Cache daemon will move filled containers to
archive - synch and purge APIs
SRB Server
UNIX
HPSS
HPSS
container
Distributed Storage Resources
cached containers
30NPACI Collaborations
- NASA - Information Power Grid
- Promote integration of Globus and SRB
authentication - DOE ASCI Data Visualization Corridor
- Promote use of XML DTDs for scientific data
- NARA - Persistent Archive
- Collection based data management
- DOE NGI - Particle Physics Data Grid
- Replication of data across multiple servers
- NSF DLI-II - InterLib
- Interoperable services between digital libraries
- California Digital Library - AMICO
- Educational access to image collections
31Education and Outreach
- California Digital Library - AMICO
- Educational access to image collections
- 1.5 TB of images
- Tunable interfaces for students, educators,
researchers - Digital Insight - U Wisconsin
- Provide access to class videos archived at SDSC
- 10-20 TB of videos
- NARA - Historical Collections
- Mediate information between local collections and
NARA collections
32For More Information
- http//www.npaci.edu/DICE/