Title: TG 06 Data Collections Tutorial
1TG 06 Data Collections Tutorial
- Natasha Balac
- Roman Olschanowsky
2TG Collections Present and Future
- Data collections represent permanent data storage
that is organized, searchable, and available to a
wide audience, either a collaborative group or
the scientific public in general - The term "collections" is also used to refer to
the libraries or groupings of data in Storage
Resource Broker (SRB), which is a
client/server-based suite of data storage and
movement tools
3TG Collections Present and Future
- A number of data collection resources are
available at the TeraGrid sites - The data collections table lists collections that
are available to or created by TeraGrid users - The list contains a brief description or abstract
of each collection that is currently in
production at TeraGrid sites - URL links will connect directly to the collection
interface or to more detailed information - http//www.teragrid.org/userinfo/data/collections.
php
4State of Collections
- 89 collections from 5 sites have been listed as
TeraGrid Collections on the web site - http//www.teragrid.org/userinfo/guide_data_colle
ctions_table.html - Many of these listed collections are missing info
and/or lacking critical pieces of information in
order for the collections to be useful and usable
by the user community - Large numbers of listed collections contain
either inadequate or completely missing
documentation - Many of the listed collections do not have any
apparent connections to the rest of the TGs
resources or users
5State of Collections
- This is due to lack of any kind of coherent
policy that states the requirements that a TG
collection needs to satisfy in order to be made
available on TG resources and become an
official TG collection - TG Collections and usage models requirement
analysis team (RAT)
6Collections RAT Charter
- Data Collections are a significant resource
providing one of the vital non-compute classes of
service to the TeraGrid user community - Data Collections differ from the traditional
compute resources and therefore necessitate
special attention - Potential and value must be presented to and
utilized by the scientific user community in an
effective manner - Special requirements must be made clear to and
understood by both users and TeraGrid staff
7Collections RAT Charter
- Data collections should represent permanent data
storage that is well organized, documented,
searchable, publicly available and valuable to a
wide audience - All TG collections should provide common look and
feel - Since providing data collections as a resource is
a fairly new endeavor, there are many
uncertainties about collection usability,
infrastructure, policies and many practical usage
questions
8Collections RAT Recommendations
- We worked on answering questions regarding what
defines and what constitutes a formally
designated TG collection - What is an appropriate TG collection?
- What criteria a collection must meet to qualify
as a TG collection? What is the value added to
the user community? - What does it mean to make data sets available on
TG? Who, how and where? - What are the usage models for TG data collections
whether they are tied to other TG resources or
not? Should the data be in conjunction with
compute resources, visualization resources, other
data collections, science gateways or some other
TG resource?
9TG collection Categories
- Recommendation is that there should be several
designated categories of TG collections based on
usage model criteria - Category 1 Grid related Data Collections
- Category 2 Compute related Data Collections
- Category 3 GPFS-WAN Data Collections
- Category 4 TG affiliated Collections
- Category 5 Non-TG Collection Other options (
DataCentral, etc.)
10Category 1 Grid related Data Collections
- Data is stored on TG resources using TG network
to access - Data is hosted on the HW connected to the TG
network taking advantage of the high speed
network - Usage model interface to efficiently retrieve
data from the repository provided - Data stored at the resource that might require/be
protected by Globus authentication - Data collection provides low level APIs for
accessibility or a resource defined API (web
pages vs. JDBC or CGI script returning XML) - Data is accessible through standard TG software
stack utilities interface like Globus, SRB,
gridftp, etc.
11Category 2 Compute related Data Collections
- Collections using TG compute resources
- Collections using visualization or data analysis
resources provided by TG - Collections might have different front ends
portal, gateway, etc. - Examples
- Purdue portal-consolidating several earth
observation data collections into one convenient
portal - http//www.purdue.teragrid.org/portal
- Gateways
12Category 3 GPFS-WAN Data Collections
- Collections sitting on GPFS-WAN
- Taking advantage of proximity of the compute
resources - Collections that are being computed on
- Allocations RAT working on making section of
GPFS-WAN disk space allocated for collections
13Category 4 TG affiliated Collections
- Collection belonging to the TG Related project
- Collection exhibits some tenuous link to TG
- Manifests a potential to be ingested into the
grid and become true TG collection
14 Category 5 Non-TG CollectionOther options
- DataCentral and Data allocations process
15 What is Data Central?
- The first program of its kind to support research
and community data collections and databases - Data Central makes it possible to store, manage,
analyze, mine, share and publish data collections
thereby enabling access and collaboration in the
broader scientific community
16Data Central at work
- Eligible researchers can request a data
allocation from SDSC (with or without a compute
allocation) that permits expanded access to
SDSC's Data Central facilities and services for
data collections management, data analysis and
data mining
17Why SDSC Data Central?
- Todays scientists and engineers are increasingly
dependent on valued community data collections
and databases - SDSC has experienced increasing demand by the
domain communities for collaborations on data
management including - publishing of data in digital libraries
- sharing of data through the Web and data grids
- creating, optimizing, porting large scale
databases - analyzing and data mining large scale data
18A Deluge of Data
- Today, data comes from everywhere
- Scientific instruments
- Experiments
- Sensors and sensor nets
- New devices
- And is used by everyone
- Scientists
- Consumers
- Educators
- General public
- IT environments must support unprecedented
diversity, globalization, integration, scale, and
use
Life Sciences
Preservationand Archiving
Astronomy
19What does SDSC Data Central offer?
- SDSC has been actively working with and
collaborating with many researchers and national
scale projects in their data management efforts - We offer Expertise and Resources for
- Public Data Collections and Database Hosting
- Long-term storage (tape and disk)
- Remote data management and access (SRB)
- Data Analysis and Data Mining
- Professional, qualified 24/7 support
20SDSC Data Resources
- 540 TB Storage-area Network (SAN)
- 1 PB On-line disk
- 6 PB StorageTek tape library capacity
- DB2, Oracle, MySQL
- Storage Resource Broker
Petabyte-scale high-performance tape storage
system
High-performance SATA SAN disk storage system
21Data Resources Available through DataCentral
- Disk
- 400 Terabytes SATA SAN Fibre Channel Attached
- Enables multiple high-end computers, using a
range of operating systems, to share data rapidly
and seamlessly - Growing data storage capabilities are integrated
with high-end computational resources such as
SDSCs 15.6 Teraflop DataStar IBM supercomputer
and parallel I/O - Accessible Mounted, Web, SRB, GridFTP
- Tape
- 6 Petabyte Capacity High Speed Robotic Silos
- Disk cache front end, transparently mounted via
Sun SAMQFS file system - Accessible Mounted, Web, SRB, GridFTP
22Data Resources Available through DataCentral
- Databases
- DB2, Oracle, MySQL servers
- High Availability, High Performance
- Accessible Standard RDMS connectivity, client
software installed on most systems - Software
- Storage Resource Broker (SRB) State-of-the-art
data management and collaboration software for
grid file access - Powerful software applications covering a range
of disciplines including bioscience, geoscience,
astronomy, chemistry, medicine, etc. - A wide array of data analysis, mining and
visualization tools
23Data Resources Available through DataCentral
- Expertise in
- High performance large data management
- Data migration, upload and sharing through the
grid - Database application tuning, porting and
optimization - SQL query tuning
- Schema design
- Data analysis and data mining
- Portal creation and collection publication
24Data Resources Available through DataCentral
Quality User Support
- Consulting
- Phone, Web, e-mail
- M-F, 9 a.m. - 5 p.m.
- 24x7 Help Desk/Operational Support
- Training
- Documentation
- User Portals
- Targeted Optimization and Porting (TOP)
- Strategic Applications Collaborations (SAC)
- Strategic Community Collaborations (SCC)
25SDSC Data Central Architecture
data-login
web farm
Datastar
GPFS 108 TB
HPSS
DB2
Oracle
35 TB
13 TB
25 TB
Teragrid
GPFS 51 TB
6 PB Tape Capacity
Teragrid GPFS-WAN 210 TB
SamQFS
400 TB
Bluegene (Intimidata)
GPFS 40 TB
26Partial list of databases and data collections
currently housed at SDSC
- Protein Data Bank (protein data)
- National Virtual Observatory (astronomical data)
- UCSD Libraries Image Collegion (ArtStore)
- National Science Digital Library (education
collection) - SCEC (earthquake data)
- BIRN (neuroscience data)
- Encyclopedia of Life (genomic data)
- TreeBase (phylogeny and ontology information)
- Transport Classification Database (protein
information) - Library of Congress data
- CKAAPS (protein evolutionary information)
- AfCS Molecule Pages (protein information)
- SLACC-JCSG (structural genomics data)
- APOPTOSIS DB (proteins related to cell death
data) - NAVDAT (geochemistry data)
- QRC (NSF data on Supercomputer Centers and PACI)
- Network Topology Data (Skitter project)
- UC Merced Library
- Biology Workbench Databases (mirrors and
originals of over 80 biology databases)
- 2 Micron All Sky Survey (astronomy data)
- Digital Palomar Observatory Sky Survey Collection
(astronomy data) - Sloan Digital Sky Survey Collection (astronomy
data) - Interpro Mirror (protein data)
- HPWREN (Wireless Network Network Analysis Data)
- HPWREN (sensor network data)
- Security logs and archives (security information)
- EarthRef Digital Archive (earth science
information) - GERM (earth reservoir information)
- Braindata (Rutgers neuroscience collection)
- HyperLTER (hyperspectral images)
- SIO-Explorer (oceanographic voyages)
- Transana (classroom video)
- WebBase (web crawls)
- Alexandria Digital Library (photographs)
- Backskatter Data (from UCSD network telescope)
- Digital Earth Data Library (earth sciences
related datasets) - GEON (PaleoGeographic Atlas project)
- IMDC (Internet measurement data catalog)
- Seamount Catalogue (bathymetric seamount maps)
- Hayden Planetarium Collection (astronomical data)
- TeraGrid Data (science and engineering
collections) - Biocyc (collection of pathway/genome DBs)
- Digital Embryo (human embryology)
- National Archives (persistent archive)
- San Diego Conservation Resources Network
(sensitive species map server) - LDAS (land data assimilation system)
- ROADNET (sensor data)
- NPACI Data Grid (scientific simulation output)
- Salk (biology data archive)
- Backbone Packet Header Traces (OC48, OC12)
- Teragrid (science and engineering collections)
- CHRONOS (analytical tools for chronostratigraphy)
- ERESE (educational Earth science portal)
- TeraBridge (Sensor stream data)
- C5 Landscape (UCSD Art dept)
27Sites Using the SRB
28SDSC SRB Projects (60 million, .5 PB )
- Digital Libraries
- UCB, Umich, UCSB, Stanford,CDL
- NSF NSDL - UCAR / DLESE
- NASA Information Power Grid
- Astronomy
- National Virtual Observatory
- 2MASS Project (2 Micron All Sky Survey)
- Particle Physics
- Particle Physics Data Grid (DOE)
- GriPhyN
- SLAC Synchrotron Data Repository
- Medicine
- Digital Embryo (NLM)
- Earth Systems Sciences
- ESIPS
- LTER
- Persistent Archives
- NARA
- LOC
29Integrated Data Cyberinfrastructure
coordination
integration
30Getting an Allocation
- Who should apply?
- Open to researchers affiliated with US
educational institutions - Proposals merit-reviewed quarterly by Data
Allocations Committee - Types of Allocations
- Expedited Allocations
- 1 TB or less of disk tape 1st year
- 30 GB Database 1st year
- Yearly review
- Medium Allocations
- Under 30 TB
- Large Allocations
- Larger than 30 TB
- Data Allocations
- Getting Started http//datacentral.sdsc.edu
31TG Data Collections Allocations
- How to provide a mechanism for the addition of
new Data Collections that satisfy the specified
requirement - How should the current allocations process be
modified to accommodate the data collections
resource? - What are the storage and allocations guidelines
and policies in conjunction with data
collections? Who is eligible to request a data
allocation? - How to extend/terminate allocations? What kind of
review process is involved? What parameters
would be considered? - Usage monitoring and usage tracking tools
32TG Collection Process and Procedures
- Documentation Well Defined Metadata
- Yearly review for each collection
- Formal Allocation process
- Accounting process
- TG Collections coordinator
- SLA for each collection, renewal
- Parking vs. Collections vs. Project data
33Other tools provided supporting collections
- What tools for management, analysis, mining,
access and documentation will be provided with
the collections? - Who is in responsible for providing, installing,
maintaining tools? - What data managing tools are available?
34Thank You
- Natasha Balac natashab_at_sdsc.edu
- Roman Olschanowsky roman2u_at_sdsc.edu