GEMS and Data Mining Building the Grid Infrastructure - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

GEMS and Data Mining Building the Grid Infrastructure

Description:

Grid Services: Globus GT2.2.4, gsi, Condor-G, CACL. Math Libs. I/O: HDF4/5, ... Condor-G. Storage Resource Broker (SRB) Grid Portal Toolkit (GridPort) MPICH-G2 ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 31
Provided by: david921
Category:

less

Transcript and Presenter's Notes

Title: GEMS and Data Mining Building the Grid Infrastructure


1
GEMS and Data MiningBuilding the Grid
Infrastructure
  • Chaitan Baru
  • Program Co-Director
  • Data and Knowledge Systems
  • San Diego Supercomputer Center

2
SDSC Organizational Structurewww.sdsc.edu
Fran Berman, Director Alan Blatecky, Exec
Director Richard Moore, NPACI Exec Director Anke
Kamrath, COO
600 employees/students total
Office of the Director
Data and Knowledge Systems (DAKS)
Integrative Computational Sciences (ICS)
Integrative Biological Sciences (IBS)
High-End Computing (HEC)
Grids and Clusters (GC)
  • Cluster management
  • Portals
  • Grid middleware
  • Production systems
  • Molecular biology
  • Neuroscience
  • Structural Genomics
  • Cell Signaling
  • Proteomics
  • Computational chemistry
  • Applied math
  • Ecoinformatics
  • Environmental Science
  • Computational Economics
  • User Services
  • Data integration
  • Distributed data management
  • Scientific databases
  • Data mining
  • Scientific data visualization

Networking and Security (NS)
Education and Training
Communications And Outreach
Business Office
  • Production networking and security
  • Research on network monitoring

3
The DAKS Program
  • Organized as a set of RD Labs
  • Knowledge-based Integration (Bertram Ludaescher)
  • Advanced Query Processing (Amarnath Gupta)
  • Advanced Database Projects (David Archbell)
  • Data Mining (Tony Fountain)
  • Visualization (Michael Bailey)
  • Spatial Information Systems (Ilya Zaslavsky)
  • Geoinformatics (Dogan Seber)
  • Storage Resource Broker, SRB (Arcot Rajasekar)
  • Sustainable Archives and Digital library
    Technology (Richard Marciano)

4
Outline
  • Some distributed/grid computing environments
  • TeraGrid, NPACI Grid, GEON, BIRN, LTER Network
  • Hardware, software, middleware
  • Middleware for data management, exploration, and
    mining
  • Some data-oriented / data-intensive application
    use cases
  • Data-oriented middleware
  • SRB, SKIDLKit, GEMS

5
Prototype for Cyberinfrastructure
6
TeraGrid Common Teragrid Software Stack (CTSS)
  • OS Linux (SuSE), but also others
  • Compilers gcc, Intel C/C, Intel Fortran
  • MPICH
  • Schedulers OpenPBS, Maui
  • Grid Services Globus GT2.2.4, gsi, Condor-G,
    CACL
  • Math Libs
  • I/O HDF4/5, GPFS, PVFS
  • Collection Management SRB client
  • Monitoring Ganglia, Clumon

7
NPACI Grid Sites and Platforms
Blue Horizon DataStar
8
SDSC DataStar
  • Next major acquisition at SDSC
  • IBM Power-based system, optimized for
    data-oriented applications (large I/O as well as
    DBMS)
  • Likely to be 7TF system
  • 128 x 8 processor nodes, 16GB/node (2TB memory)
  • 8 x 32 processor nodes (6 _at_ 64GB/node, 1 _at_ 128GB,
    1 _at_ 256GB) (768GB memory)
  • High-speed switch interconnect
  • FCS interfaces to SAN-based disk

9
NPACKage Focus on impact, interoperability and
usability
  • NPACKage
  • Interoperable collection of NPACI SW targeted for
    national-scale distribution
  • NPACKage Components
  • The Globus Toolkit?.
  • GSI-OpenSSH.
  • Network Weather Service
  • DataCutter
  • Ganglia
  • LAPACK for Clusters (LFC)
  • MyProxy
  • GridConfig
  • Condor-G
  • Storage Resource Broker (SRB)
  • Grid Portal Toolkit (GridPort)
  • MPICH-G2
  • APST (AppLeS Parameter Sweep Template)
  • Kx509
  • Technology integration
  • All-to-all interoperability
  • Packaging and deployment
  • Maintenance
  • User support
  • Documentation
  • Consulting
  • Help-desk
  • User feedback key to improvement in FY04

10
Biomedical Informatics Research
NetworkParticipating Sites
PI of BIRN CC Mark Ellisman Co-Is of BIRN CC
Chaitan Baru, Phil Papadopoulos, Amarnath Gupta,
Bertram Ludaescher
11
BIRN Commonality is the Key
  • Hardware HP DL380 processors, common CISCO
    switch, Netscout monitoring software, gigabit
    connectivity
  • Operating Systems Red Hat Linux
  • Database Oracle
  • Applications Storage Resource Broker, data
    integration and mediators, variability in back-up
    solutions
  • BIRN Portal common user interface, able to
    launch unique user applications

12
BIRN Project Objectives
  • Establish a stable, high performance network
    linking key Biotechnology Centers and General
    Clinical Research Centers
  • Establish distributed and linked data collections
    with partnering groups - create a Data GRID
  • Facilitate the use of "grid-based" computational
    infrastructure and integrate BIRN with other GRID
    middleware projects
  • Enable data mining from multiple data collections
    or databases on neuroimaging and bioinformatics
  • Build a stable software and hardware
    infrastructure that will allow centers to
    coordinate efforts to accumulate larger studies
    than can be carried out at one site.

13
The GEON Grid
  • OptIPuter / GEON Project connect NASA Goddard
    to SDSC via optic fiber

SDSC PI Chaitan Baru SDSC co-PIs Phil
Papadopoulos, Bertram Ludaescher, Michael Bailey
14
GEON Software Stack
  • OGSA
  • Information Integration software
  • IBM Information Integrator
  • SDSC GEMS
  • Grid Data services
  • Replication Grid Movement and Replication
  • Replica Location Services
  • Community Authorization Service
  • Grid Monitoring and Discovery, Network Weather
    Service,
  • GEON Portal Development
  • Search and Discovery interface
  • Workflow specification, customization, execution
  • Data and Information Visualization tools

15
GMR ArchitectureCollaboration with IBM Almaden
(Inderpal Narang et al)
Gravity Cache Postgresql SDSC node
Gravity DataSet Oracle UTEP node
16
Building the BIRN Portal
Schematic overview of the layered software
architecture leveraging Grid middleware
technologies to link users to distributed
resources
17
Application Use Cases
  • Different classes of I/O
  • Read and/or generate large, individual files
  • traditional supercomputing applications
  • Read large data collections
  • E.g., Digital sky, system log files
  • Database applications
  • E.g., Digital sky, Protein Data Bank
  • Remote vs. local data
  • Compute engines remote from data archives or
    data owners
  • Staging vs prefetching vs. synchronous I/O
  • Ability to reserve disk vs. rewriting I/O calls
    vs. fast communications

18
Data Middleware
  • SDSC Storage Resource Broker (SRB)
  • See http//srb.sdsc.edu
  • SKIDLkit (SDSC Knowledge and Information
    Discovery Lab Kit)
  • Led by Tony Fountain
  • See http//www.sdsc.edu/SKIDL
  • Web-services based environment to provide access
    to data sources and analysis tools
  • SDSC Grid-Enabled Mediation Services (GEMS)

19
SDSC Storage Resource Broker

SRB clients mySRB, UNIX shell (s-commands), inQ,
C/C libs
User Application
Metadata Extraction
Remote Proxies
MCAT
DataCutter
20
SKIDLKit and the LTER ProjectCurrent Status
  • Work with four key LTER sites (NTL, VCR, AND,
    JRN)
  • Extend to other sites, and implement as Grid
    services
  • Based on Apache Tomcat 4.1.24, Apache Axis 1.1,
    JDK 1.4
  • Generic Service Wrappers using Java JDBC
  • For Oracle database at NTL site, MySQL database
    at VCR site, SQL Server database at AND site
  • And, using the sites EML (Ecological Metadata
    Language) config file
  • Designed a simple standard in XML to unify
    climate data expression across four LTER sites
  • Access to some data mining tools

21
An International Computational Grid for Ecology
and the Environment
Underwater Sensors
LTER-NTLMadison, WI
LTER-ANDCorvallis, OR
LTER-VCRCharlottesville, VA
JDBC
SOAP / XML
JDBC / EML
JDBC
SOAP / XML
CAS / CNICBeijing, China
NARCTsukuba, Japan
HTTP
JDBC
SDSCLa Jolla, CA
NCHCHsinchu, Taiwan
- SOAP Servers where web services are deployed
- Database Servers where data sources are hosted
- Sensor Data from web cam deployed at fields
22
SDSC Grid-Enabled Mediation Services(GEMS)
  • Based on XML, Xquery (next generation of
    MIXMediation of Information using XML)
  • Defined in terms of a set of services that are
    used at
  • Registration time
  • Dataset registration, schema registration,
    ontology registration
  • Source content and capability related services
    e.g., term resolution service, capability
    description service,
  • View definition time
  • Data Integration Services, Discovery services
  • Query formulation time
  • Query runtime
  • Dynamic binding of logical to physical resources
  • Administrative Services
  • Services to manage access controls, control
    replicas,

23
Integrate Geologic Data From Multiple Sources
Using Ontology and Map Assembly Web Services (to
be deployed by USGS)
Mediator
Ontology
Legend Generator
Map Assembler

GRID SERVICES FOR MAP INTEGRATION
ArcIMS Services wrapped In WSDL/SOAP
24
GEON Information Integration
PaleoBiology
25
GeMS Components
Registration Services
Deployment Services
Data Integration Services
Client

Query Optimization Plan Generation
Verification, Access Control, and Query Rewrite
Result Assembly (e.g. map generation)
Metadata Registry
Ontology Service
Community Authorization Service
Monitoring Discovery Service
Replica Location Service
File system
Databases
Compute Resources
File system
Databases
Compute Resources
File system
Databases
Compute Resources
Network Weather Service
Distributed Compute and Storage Resources
26
GeMS Request Processing Scenario
Client
GeMS Query Planner
Mediator
GeMS Plan
GeMS Logical ? Physical Query Plan binding
Wrapper
Ontology Service(s)
27
Some Issues
  • Function shipping versus data shipping
  • Need to deal with different levels of access
    provided by different sites, example
  • Native API access to databases
  • JDBC
  • Web services (with full query vs limited query
    access)
  • Read-only vs read-write (dealing with temp
    results, annotations)

28
Contact InfoChaitan Barubaru_at_sdsc.edu
29
SDSC Machine Room Data Architecture
  • .5 PB disk, 6 PB archive
  • 1 GB/s disk-to-tape
  • Optimized support for DB2 (Regatta) / Oracle (Sun
    15K)

LAN (multiple GbE, TCP/IP)
Local Disk (50TB)
Power 4 DB
DataStar
Blue Horizon
WAN (30 Gb/s)
HPSS
Sun F15K
Linux cluster 4TF
SCSI/IP or FC/IP
SAN (2 Gb/s, SCSI)
30 MB/s per drive
200 MB/s per controller
FC Disk Cache (400 TB)
FC GPFS Disk (100TB)
Servers
Vis Engine
Silos and Tape, 6 PB, 1 GB/sec disk to tape 32
tape drives
30
Current DTF/ETF Sites
Write a Comment
User Comments (0)
About PowerShow.com