Data Intensive Computing at SDSC - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Data Intensive Computing at SDSC

Description:

e.g. high resolution X-ray images in neuroscience. Managing a large number of small data sets ... Neuroscience - brain mapping images. Social science - census ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 22
Provided by: npa5
Category:

less

Transcript and Presenter's Notes

Title: Data Intensive Computing at SDSC


1
Data Intensive Computing at SDSC
  • Chaitan Baru
  • Senior Principal Scientist
  • Data Intensive Computing Group
  • San Diego Supercomputer Center

2
NPACI Data management issues
  • Managing a small number of large data sets
  • e.g. high resolution X-ray images in neuroscience
  • Managing a large number of small data sets
  • e.g. digital sky surveys, large document
    collections
  • Querying metadata to identify data sets
  • move from file-oriented view to
    database-oriented view
  • Accessing large data collections distributed
    across a network (including parallel I/O)
  • Querying mediated views over distributed,
    heterogeneous information sources

3
Collections with large number of small data sets
  • Digital Sky surveys
  • About 2 billion objects corresponding to light
    sources in the sky
  • Each is about 1K in size
  • Patent documents
  • About 2 million documents
  • Each is about 75K in size
  • NARA document collections
  • HTML pages, word processing documents, email
    messages

4
DICE Technologies
  • Persistent archives
  • HPSS
  • DB2/HPSS (Oracle/HPSS)
  • Digital library
  • SRB/MCAT
  • InterLib
  • IBM DL
  • Information mediation
  • The MIX project
  • GIS data sources
  • Multimedia sources

5
DICE Applications
  • Molecular biology - PDB
  • Neuroscience - brain mapping images
  • Social science - census data sets
  • Digital library - ELIB, ADL, Infobus, CDL
  • NARA - Electronic records management
  • ASCI - Data visualization corridor
  • NASA - Information Power Grid (IPG)
  • GDE/Marconi - Image libraries
  • CDL - AMICO image collection
  • USPTO - Dist. Object Computation Testbed

6
Managing very large data sets
The IBM High Performance Storage System (HPSS)
  • Runs on 14-node IBM RS/6000 SP, including 8
    four-way SMP nodes (Silver nodes)
  • 1TB SSA disk, 3 StorageTek silos with 360TB
    capacity
  • HiPPI connected devices, parallel I/O
  • Over 70 (multithreaded) server processes
  • Multiple classes of service for managing file
    storage lt2MB, 2-200MB, 200MB-6GB, 6GB-100TB

7
HPSS Archival Storage System
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
RS6000 Tape Mover PVR (9490)
9490 Robot Eight Tape Drives
108 GB
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
108 GB
9490 Robot Four Drives
High Performance Gateway Node
3490 Tape
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
54 GB
SSA RAID
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
108 GB
Trail- Blazer3 Switch
HiPPISwitch
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
SSA RAID
9490 Robot Seven Tape Drives
High Node Disk Mover HiPPI driver
108 GB
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
SSA RAID
54 GB
Silver Node Tape / disk mover DCE / FTP /HSI
Log Client
SSA RAID
Wide Node Disk Mover HiPPI driver
108 GB
MaxStrat RAID
Silver Node Storage / Purge Bitfile / Migration
Nameservice/PVL Log Daemon
SSA RAID
160 GB
830 GB
8
Managing large number of small data sets
The Integrated DB2/HPSS System
Database table
Create Tablespace HPSS-SPACE Managed By
Database Using FILE (HPSS lthpss-filenamegt
ltsizegt DISKBUF ltpathgt ltsizegt)
C4
C5
C1
C2
C3
DB2
Create Table SAMPLE-TABLE (C1 int, C2 float, C3
char, C4 CLOB, C5 BLOB) In REGULAR-SPACE
DB2 disk buffer
HPSS
HPSS disk cache
9
Other DBMS/archival storage integration efforts
  • Oracle / AMASS
  • Oracle uses AMASS as a file server
  • Objectivity / HPSS
  • Being developed by Stanford SLAC
  • Implements an OO staging system between
    Objectivity and HPSS

10
Metadata-based access to data sets
The SDSC Storage Resource Broker (SRB)
Application (SRB client)
SRB Middleware
MCAT
SRB Servers
DB2, Oracle, Illustra, ObjectStore
HPSS, UniTree
UNIX, ftp
11
Querying heterogeneous information sources
User Interface
User Interface
Query
Results
Mediator (with views)
Local data repository
Query fragment
Query fragment
Convert incoming query and outgoing data
Wrapper
Wrapper
Wrapper
SQL Database
Spreadsheet
HTML, other files
12
The MIX ProjectMediation of Information using
XML
  • TEAM
  • UCSD CSE Yannis Papakonstantinou, Pavel
    Velikhov, Victor Vianu
  • SDSC DICE Chaitan Baru, Amarnath Gupta, Bertram
    Ludaescher, Richard Marciano

13
MIX Components
  • Wrapper tool-kit
  • model information in a resource using XML DTD
    (or, XML schema), including a mapping of source
    data to DTD
  • provide mapping from XML query language to source
    query language / operations
  • Mediator tool-kit
  • allows definition of views across multiple
    resources
  • views are expressed in a declarative query
    language
  • provides a query engine (for composing results)

14
MIX components...
  • XML Matching And Structuring (XMAS) query
    language
  • operates on a given set of XML documents to
    produce a new XML documents, using XMAS algebra
  • DOM-VXD DOM Virtual XML Document
  • a lazy implementation of DOM. Supports
    browsing/ navigation of XML documents with a
    server-side, compute as you go model

15
MIX components...
  • Blended Browsing and Querying (BBQ) interface
  • supports navigation and querying of XML documents
  • generates XMAS queries on mediator views
  • generates XMAS queries modified by DOM-VXD
    operations to incrementally evaluate the result
    set, to support navigation of XML documents

16
Details of the MIX Scenario
View 1
View 2
BBQ Interface
BBQ Interface
XML data
XMAS query
Mediator
XMAS query engine
Local Data Repository
XMAS query fragment
XML data
Convert XMAS query to local query language, e.g.
SQL, and data in native format to XML
Wrapper
Wrapper
Wrapper
SQL Database
Spreadsheet
HTML files
17
The NARA Project
  • Electronic Records Management and Persistent
    Archives
  • Archive a variety of data
  • Census data (Tiger files)
  • E-mail (Usenet newsgroups used as proxy)
  • Congressional voting records (ASCII, HTML files)
  • Vietnam war casualty reports
  • Miscellaneous word processing documents
  • USPTO
  • ...

18
The NARA Usenet Collection
  • 1 million Usenet postings
  • Data archived in its original form
  • Designed an XML DTD based on Usenet standard and
    analysis of 1 million documents
  • Documents have headers with
  • 6 required keyword fields (e.g. From, Date,
    Subject)
  • 13 optional keyword fields (e.g. Followup-To,
    Keywords)
  • Rest are unrecognized keyword fields (e.g.
    Abuse-Reports-To). Found about 2200 in 1 million
    messages.

19
Ingestion Retrieval of Data Collections
Extract metadata (SGML/XML)
Data Collection
HPSS
Extract metadata
Query Interface
DBMS
20
On-going work
  • Wrappers for GIS sources
  • modeling GIS information sources with XML DTDs
  • mapping from query language (XMAS) to operations
    supported by GIS
  • returning output from GIS in the form of XML
    documents
  • Mediation of GIS sources
  • dealing with sources with differing
    capabilities, e.g. coverage, resolution,
    themes/layers
  • techniques for specifying source capabilities and
    using that information in query processing

21
Announcements
  • MIX Demo at the ACM SIGMOD99 Intl. Conf. On
    Database Systems
  • May 31- June 3, 1999, Philadelphia, PA
  • Birds of a Feather session on, Data Modeling with
    XML
  • 12-130PM on Thursday 1/28 in Room 362 SDSC
Write a Comment
User Comments (0)
About PowerShow.com