Title: Natasha Balac and Roman Olschanowsky
1- Natasha Balac and Roman Olschanowsky
- natashab_at_sdsc.edu, roman2u_at_sdsc.edu
- San Diego Supercomputer Center
http//datacentral.sdsc.edu
2What is Data Central?
- Data Central makes it possible to store, manage,
analyze, mine, share and publish data collections
thereby enabling access and collaboration in the
broader scientific community - Eligible researchers can request a data
allocation from SDSC (with or without a compute
allocation) that permits expanded access to
SDSC's Data Central facilities and services for
data collections management, data analysis and
data mining
http//datacentral.sdsc.edu
3Why SDSC Data Central?
- Todays scientists and engineers are increasingly
dependent on valued community data collections
and databases - SDSC has experienced increasing demand by the
domain communities for collaborations on data
management including - publishing of data in digital libraries
- sharing of data through the Web and data grids
- creating, optimizing, porting large scale
databases - analyzing and data mining large scale data
http//datacentral.sdsc.edu
4A Deluge of Data
- Today, data comes from everywhere
- Scientific instruments
- Experiments
- Sensors and sensor nets
- New devices
- And is used by everyone
- Scientists
- Consumers
- Educators
- General public
- IT environments must support unprecedented
diversity, globalization, integration, scale, and
use
Life Sciences
Preservationand Archiving
Astronomy
http//datacentral.sdsc.edu
5What does SDSC Data Central offer?
- SDSC has been actively working with and
collaborating with many researchers and national
scale projects in their data management efforts - We offer Expertise and Resources for
- Public Data Collections and Database Hosting
- Long-term storage (tape and disk)
- Remote data management and access (SRB)
- Data Analysis and Data Mining
- Professional, qualified 24/7 support
http//datacentral.sdsc.edu
6SDSC Resources
- SDSC operates high-end computing and data
resources, with NSF support, for the nations
academic community - Allocations of time made quarterly through
merit-review of proposals by panel of
computational scientists
SDSCs DataStar
http//datacentral.sdsc.edu
7SDSC Compute Resources
- DataStar 15.6 TFlops
- 1,628 Power4 processors
- IBM p655 and p690 nodes
- 4 TB total memory
- Up to 2 GBps I/O to disk
- TeraGrid Cluster 4.4 TFlops
- 512 Itanium2 IA-64 processors
- 1.5 TB total memory
- Intimidata 5.7 TFlops
- Only academic IBM Blue Gene system
- 2,048 PowerPC processors
- 128 I/O nodes
Intimidata Installation
http//datacentral.sdsc.edu
8SDSC Data Resources
- 540 TB Storage-area Network (SAN)
- 1 PB On-line disk
- 6 PB StorageTek tape library capacity
- DB2, Oracle, MySQL
- Storage Resource Broker
- Gpfs-WAN with 226 TB
Petabyte-scale high-performance tape storage
system
High-performance SATA SAN disk storage system
http//datacentral.sdsc.edu
9Data Resources Available through DataCentral
- Disk
- 400 Terabytes SATA SAN Fibre Channel Attached
- Enables multiple high-end computers, using a
range of operating systems, to share data rapidly
and seamlessly - Growing data storage capabilities are integrated
with high-end computational resources such as
SDSCs 15.6 Teraflop DataStar IBM supercomputer
and parallel I/O - Accessible Mounted, Web, SRB, GridFTP
- Tape
- 6 Petabyte Capacity High Speed Robotic Silos
- Disk cache front end, transparently mounted via
Sun SAMQFS file system - Accessible Mounted, Web, SRB, GridFTP
10Data Resources Available through DataCentral
- Databases
- DB2, Oracle, MySQL servers
- High Availability, High Performance
- Accessible Standard RDMS connectivity, client
software installed on most systems - Software
- Storage Resource Broker (SRB) State-of-the-art
data management and collaboration software for
grid file access - Powerful software applications covering a range
of disciplines including bioscience, geoscience,
astronomy, chemistry, medicine, etc. - A wide array of data analysis, mining and
visualization tools
11Data Resources Available through DataCentral
- Expertise in
- High performance large data management
- Data migration
- Database application tuning, porting and
optimization - SQL query tuning
- Portal creation and collection publication
- Schema design
- Database selection (Oracle, DB2, or MySQL or flat
files - Data migration, upload and sharing through the
grid - Data analysis and data mining
12Data Resources Available through DataCentral
Quality User Support
- Consulting
- Phone, Web, e-mail
- M-F, 9 a.m. - 5 p.m.
- 24x7 Help Desk/Operational Support
- Training
- Documentation
- User Portals
- Targeted Optimization and Porting (TOP)
- Strategic Applications Collaborations (SAC)
- Strategic Applications Collaborations (SAC)
- Strategic Community Collaborations (SCC)
http//datacentral.sdsc.edu
13Quality User Support Training
- Quarterly workshops on high-end computing
- Annual summer institute
- Focused work on participants projects
- Student expenses paid
- Extreme I/O focus this year
- Web-based training
- CI Channel
- Staff-run workshops at remote site
http//datacentral.sdsc.edu
14Strategic Collaborations
- Strategic Applications Collaborations (SAC)
- SDSC staff paired with domain scientists for
projects lasting 3-12 months - Strategic Community Collaborations (SCC)
http//datacentral.sdsc.edu
15SCC Encyclopedia of Life
- Information on the what and how of the proteins
associated with 800 genomes - Protein annotation and 3D structure modeling of
complete or partial genomes through a
computational pipeline - Information stored in Web-accessible database
that can be mined - International partner sites (Singapore, Japan)
will use same EOL pipeline
http//datacentral.sdsc.edu
16Data Central SAC Project Sloan Sky Survey
- The Sloan Digital Sky Survey (www.sdss.org) will
map one-quarter of the entire sky and perform a
redshift survey of galaxies, quasars, and stars - The project requires a database migration from
SQL Server to DB2 - Moving 78 tables and 39 Views containing 6 TB of
image data - Converting 168 functions and 126 stored
procedures from T-SQL to PL SQL
http//datacentral.sdsc.edu
17What is Data Science?
- Data science applications rely on data for new
discovery - Many compute applications are also data science
applications -- the computation uses data as
input or is the data source
- Data Science application models include
- Large computation requirements, access to or
generation of big data - Extreme usage (very large data, very long-term
data)
http//datacentral.sdsc.edu
18What Science Communities are Doing
- National Virtual Observatory (NVO) collections
- Collections come from large-scale telescopes
- Telescopes provide daily sweep of the sky,
scientists clean data which is then converted
from temporal to spatial data, allowing indexing
over time dimension - All NVO data on website available to the public
without restriction - Long-term preservation model is being developed
http//datacentral.sdsc.edu
19Large computation Data Science SCEC --
Modeling the big one
- Challenge What is the potential damage of a
large-scale earthquake in Southern California? - Project Conduct large-scale simulation of a
magnitude 7.7 seismic wave propagation on the San
Andreas Fault to understand seismic consequences - Project leadership Tom Jordan (SCEC)Bernard
Minster (SIO) Reagan Moore (SDSC) Carl
Kesselman (ISI)
Project funded by NSF GEO
http//datacentral.sdsc.edu
20Major Earthquakes on the San Andreas Fault,
1680-present
SCEC TeraShake
- The SCEC TeraShake simulation is a result of
immense effort from the Geoscience community for
over 10 years - Focus is on understanding big earthquakes and how
they will impact sediment-filled basins - Simulation combines massive amounts of data,
high-resolution models, large-scale supercomputer
runs
- TeraShake results provide new information
enabling better - Estimation of seismic risk
- Emergency preparation, response and planning
- Design of next generation of earthquake-resistant
structures - Such simulations provide potentially immense
benefits in saving both many lives and billions
in economic losses
1906 M 7.8
1857 M 7.8
1680 M 7.7
How dangerous is the southern San Andreas Fault?
http//datacentral.sdsc.edu
21 TeraShake
- Estimating the potential damage of a magnitude
7.7 Southern California earthquake - Large-scale simulation of seismic wave
propagation on the San Andreas Fault - 1.8 billion gridpoints
- 240 DataStar processors
- 1 TB memory
- 5 days
- 2 GB/s continuous I/O
- 47 TB output
http//datacentral.sdsc.edu
22Enabling Data Science
- Many users with large data needs
- extend above and beyond what their home
environments - increasingly dependent on valued community data
collections and databases used community-wide - Experiencing increasing demand by the domain
communities for collaborations on - publishing of data in digital libraries
- sharing of data through the Web and data grids
- creating, optimizing, porting large scale
databases - analyzing and data mining large scale data
- Comprehensive data environment that incorporates
access to the full spectrum of data enabling
resources
http//datacentral.sdsc.edu
23SDSC Data Allocations Environmentdatacentral.sdsc
.edu
Services
Parallel File-system High-speed, Temporary
Data Parking (SAN) High-speed, Short-term
Data Collections (SATA) Moderate-speed, Long-ter
m
Data Sharing (SATA) Moderate-speed, Medium-term
Disk
Local Back-up (e.g., Tape)
HPSS/SAMFS
Offsite Back-up
http//datacentral.sdsc.edu
24Data Science Support Systems
Archival Systems
Blue Gene/L
6 PB
DataStar IBM Power4
Expertise, Networking, Visualization, Storage
and Compute Resources
5.7 TF
15.6 TF
1 PB on-line disk
http//datacentral.sdsc.edu
25Partial list of databases and data collections
currently housed at SDSC
- Protein Data Bank (protein data)
- National Virtual Observatory (astronomical data)
- UCSD Libraries Image Collegion (ArtStore)
- National Science Digital Library (education
collection) - SCEC (earthquake data)
- BIRN (neuroscience data)
- Encyclopedia of Life (genomic data)
- TreeBase (phylogeny and ontology information)
- Transport Classification Database (protein
information) - Library of Congress data
- CKAAPS (protein evolutionary information)
- AfCS Molecule Pages (protein information)
- SLACC-JCSG (structural genomics data)
- APOPTOSIS DB (proteins related to cell death
data) - NAVDAT (geochemistry data)
- QRC (NSF data on Supercomputer Centers and PACI)
- Network Topology Data (Skitter project)
- UC Merced Library
- Biology Workbench Databases (mirrors and
originals of over 80 biology databases)
- 2 Micron All Sky Survey (astronomy data)
- Digital Palomar Observatory Sky Survey Collection
(astronomy data) - Sloan Digital Sky Survey Collection (astronomy
data) - Interpro Mirror (protein data)
- HPWREN (Wireless Network Network Analysis Data)
- HPWREN (sensor network data)
- Security logs and archives (security information)
- EarthRef Digital Archive (earth science
information) - GERM (earth reservoir information)
- Braindata (Rutgers neuroscience collection)
- HyperLTER (hyperspectral images)
- SIO-Explorer (oceanographic voyages)
- Transana (classroom video)
- WebBase (web crawls)
- Alexandria Digital Library (photographs)
- Backskatter Data (from UCSD network telescope)
- Digital Earth Data Library (earth sciences
related datasets) - GEON (PaleoGeographic Atlas project)
- IMDC (Internet measurement data catalog)
- Seamount Catalogue (bathymetric seamount maps)
- Hayden Planetarium Collection (astronomical data)
- TeraGrid Data (science and engineering
collections) - Biocyc (collection of pathway/genome DBs)
- Digital Embryo (human embryology)
- National Archives (persistent archive)
- San Diego Conservation Resources Network
(sensitive species map server) - LDAS (land data assimilation system)
- ROADNET (sensor data)
- NPACI Data Grid (scientific simulation output)
- Salk (biology data archive)
- Backbone Packet Header Traces (OC48, OC12)
- Teragrid (science and engineering collections)
- CHRONOS (analytical tools for chronostratigraphy)
- ERESE (educational Earth science portal)
- TeraBridge (Sensor stream data)
- C5 Landscape (UCSD Art dept)
26Integrated Data Cyberinfrastructure
coordination
integration
http//datacentral.sdsc.edu
27An Introduction to the
By Roman Olschanowsky roman2u_at_sdsc.edu
http//datacentral.sdsc.edu
28Sites Using the SRB
http//datacentral.sdsc.edu
29SDSC SRB Projects (60 million, .5 PB )
- Digital Libraries
- UCB, Umich, UCSB, Stanford,CDL
- NSF NSDL - UCAR / DLESE
- NASA Information Power Grid
- Astronomy
- National Virtual Observatory
- 2MASS Project (2 Micron All Sky Survey)
- Particle Physics
- Particle Physics Data Grid (DOE)
- GriPhyN
- SLAC Synchrotron Data Repository
- Medicine
- Digital Embryo (NLM)
- Earth Systems Sciences
- ESIPS
- LTER
- Persistent Archives
- NARA
- LOC
http//datacentral.sdsc.edu
30SDSC Storage Systems
Datastar
GPFS 108 TB
HPSS
Teragrid GPFS-WAN 210 TB
35 TB
Teragrid
GPFS 51 TB
6 PB Tape Capacity
SamQFS
400 TB
Bluegene (Intimidata)
GPFS 40 TB
http//datacentral.sdsc.edu
31SRB Helps external users manage their Data
SRB
SRB
SDSC
1 PB Disk 6 PB Tape Capacity
SRB
SRB
SRB
SRB
http//datacentral.sdsc.edu
32Storage Resource Broker (SRB)
- A distributed file system (Data Grid)
- Client-Server, Server-Server architecture.
- Abstracts physical
- SRB provides the ability to transparently share
data across remote sites. - Heterogeneous Resources
- Single sign on
- Single logical file hierarchy
http//datacentral.sdsc.edu
33What we are familiar with
http//datacentral.sdsc.edu
34What we are not familiar with, yet
http//datacentral.sdsc.edu
35Interfaces to theStorage Resource Broker
- inQ Windows Client
- Scommands UNIX, DOS Command line Client
- Jargon Java API and GUI components
- mySRB Web Client
- Matrix WSDL, Data Grid Workflows
- C, C C and C API
- Python Python API
- Perl Perl API
http//datacentral.sdsc.edu
36Common Scommands (69 total)
- Scp
- Smv (logical)
- Sphymove (physical)
- Srm
- Smkdir
- Srmdir
- Serror
- Schmod
- Sexit
- Sinit
- Senv
- Spwd
- Sls
- Scd
- Sget
- Sput
- Ssh
http//datacentral.sdsc.edu
37mySRB
http//datacentral.sdsc.edu
38BIRN Portal (perl based)
http//datacentral.sdsc.edu
39NEEScentral Portal (php based)
http//datacentral.sdsc.edu
40The BIRN SRB Data Grid
http//datacentral.sdsc.edu
41The BIRN Data Grid
http//datacentral.sdsc.edu
42The grid is in the details
http//datacentral.sdsc.edu
43File Replication
- Sls
- /home/Demo/SRB-Tutorial/files-2
- Doc.txt
- Sls -l
- /home/Demo/SRB-Tutorial/files-2
- romanoly 0 z-ucsd-ncmir-nas1 15
2003-07-09-05.15 Doc.txt - romanoly 1 z-jhu-cis-nas0
15 2003-07-09-05.16 Doc.txt - romanoly 2 z-stanford-lucas-nas 15
2003-07-09-05.16 Doc.txt - romanoly 3 z-umn-cmrr-nas0 15
2003-07-09-05.16 Doc.txt - romanoly 4 z-uci-bic-nas0
15 2003-07-09-05.17 Doc.txt
http//datacentral.sdsc.edu
44SRB Location or Slave Server
SRB
Physical Resources z-jhu-cis-nas0
z-jhu-cis-nas1
Location
z-jhu-cis-nas2
SRB
Logical Resource
jhu-cis-nas
http//datacentral.sdsc.edu
45Pooling physical resources
http//datacentral.sdsc.edu
46Logical / Compound Resources
SRB
instant replication
fast archival
resource pooling
My-Resource
SRB
http//datacentral.sdsc.edu
47In Conclusion
- SRB handles large data and provides the ability
to share and collaborate on distributed
heterogeneous resources. - www.sdsc.edu/srb
- srb_at_sdsc.edu
http//datacentral.sdsc.edu
48Getting an Allocation Its Free!
- Who should apply?
- Open to researchers affiliated with US
educational institutions - Proposals merit-reviewed quarterly by Data
Allocations Committee - Types of Allocations
- Expedited Allocations
- 1 TB or less of disk tape 1st year
- 5 GB Database 1st year
- Yearly review
- Medium Allocations
- Under 30 TB
- Large Allocations
- Larger than 30 TB
- Data Allocations
- Getting Started http//datacentral.sdsc.edu
http//datacentral.sdsc.edu
49Thank You
- SDSC Data Resources and Allocations
- http//datacentral.sdsc.edu/
http//datacentral.sdsc.edu