Natasha Balac and Roman Olschanowsky - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Natasha Balac and Roman Olschanowsky

Description:

Natasha Balac and Roman Olschanowsky – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 50
Provided by: sdsc
Category:

less

Transcript and Presenter's Notes

Title: Natasha Balac and Roman Olschanowsky


1
  • Natasha Balac and Roman Olschanowsky
  • natashab_at_sdsc.edu, roman2u_at_sdsc.edu
  • San Diego Supercomputer Center

http//datacentral.sdsc.edu
2
What is Data Central?
  • Data Central makes it possible to store, manage,
    analyze, mine, share and publish data collections
    thereby enabling access and collaboration in the
    broader scientific community
  • Eligible researchers can request a data
    allocation from SDSC (with or without a compute
    allocation) that permits expanded access to
    SDSC's Data Central facilities and services for
    data collections management, data analysis and
    data mining

http//datacentral.sdsc.edu
3
Why SDSC Data Central?
  • Todays scientists and engineers are increasingly
    dependent on valued community data collections
    and databases
  • SDSC has experienced increasing demand by the
    domain communities for collaborations on data
    management including
  • publishing of data in digital libraries
  • sharing of data through the Web and data grids
  • creating, optimizing, porting large scale
    databases
  • analyzing and data mining large scale data

http//datacentral.sdsc.edu
4
A Deluge of Data
  • Today, data comes from everywhere
  • Scientific instruments
  • Experiments
  • Sensors and sensor nets
  • New devices
  • And is used by everyone
  • Scientists
  • Consumers
  • Educators
  • General public
  • IT environments must support unprecedented
    diversity, globalization, integration, scale, and
    use

Life Sciences
Preservationand Archiving
Astronomy
http//datacentral.sdsc.edu
5
What does SDSC Data Central offer?
  • SDSC has been actively working with and
    collaborating with many researchers and national
    scale projects in their data management efforts
  • We offer Expertise and Resources for
  • Public Data Collections and Database Hosting
  • Long-term storage (tape and disk)
  • Remote data management and access (SRB)
  • Data Analysis and Data Mining
  • Professional, qualified 24/7 support

http//datacentral.sdsc.edu
6
SDSC Resources
  • SDSC operates high-end computing and data
    resources, with NSF support, for the nations
    academic community
  • Allocations of time made quarterly through
    merit-review of proposals by panel of
    computational scientists

SDSCs DataStar
http//datacentral.sdsc.edu
7
SDSC Compute Resources
  • DataStar 15.6 TFlops
  • 1,628 Power4 processors
  • IBM p655 and p690 nodes
  • 4 TB total memory
  • Up to 2 GBps I/O to disk
  • TeraGrid Cluster 4.4 TFlops
  • 512 Itanium2 IA-64 processors
  • 1.5 TB total memory
  • Intimidata 5.7 TFlops
  • Only academic IBM Blue Gene system
  • 2,048 PowerPC processors
  • 128 I/O nodes

Intimidata Installation
http//datacentral.sdsc.edu
8
SDSC Data Resources
  • 540 TB Storage-area Network (SAN)
  • 1 PB On-line disk
  • 6 PB StorageTek tape library capacity
  • DB2, Oracle, MySQL
  • Storage Resource Broker
  • Gpfs-WAN with 226 TB

Petabyte-scale high-performance tape storage
system
High-performance SATA SAN disk storage system
http//datacentral.sdsc.edu
9
Data Resources Available through DataCentral
  • Disk
  • 400 Terabytes SATA SAN Fibre Channel Attached
  • Enables multiple high-end computers, using a
    range of operating systems, to share data rapidly
    and seamlessly
  • Growing data storage capabilities are integrated
    with high-end computational resources such as
    SDSCs 15.6 Teraflop DataStar IBM supercomputer
    and parallel I/O
  • Accessible Mounted, Web, SRB, GridFTP
  • Tape
  • 6 Petabyte Capacity High Speed Robotic Silos
  • Disk cache front end, transparently mounted via
    Sun SAMQFS file system
  • Accessible Mounted, Web, SRB, GridFTP

10
Data Resources Available through DataCentral
  • Databases
  • DB2, Oracle, MySQL servers
  • High Availability, High Performance
  • Accessible Standard RDMS connectivity, client
    software installed on most systems
  • Software
  • Storage Resource Broker (SRB) State-of-the-art
    data management and collaboration software for
    grid file access
  • Powerful software applications covering a range
    of disciplines including bioscience, geoscience,
    astronomy, chemistry, medicine, etc.
  • A wide array of data analysis, mining and
    visualization tools

11
Data Resources Available through DataCentral
  • Expertise in
  • High performance large data management
  • Data migration
  • Database application tuning, porting and
    optimization
  • SQL query tuning
  • Portal creation and collection publication
  • Schema design
  • Database selection (Oracle, DB2, or MySQL or flat
    files
  • Data migration, upload and sharing through the
    grid
  • Data analysis and data mining

12
Data Resources Available through DataCentral
Quality User Support
  • Consulting
  • Phone, Web, e-mail
  • M-F, 9 a.m. - 5 p.m.
  • 24x7 Help Desk/Operational Support
  • Training
  • Documentation
  • User Portals
  • Targeted Optimization and Porting (TOP)
  • Strategic Applications Collaborations (SAC)
  • Strategic Applications Collaborations (SAC)
  • Strategic Community Collaborations (SCC)

http//datacentral.sdsc.edu
13
Quality User Support Training
  • Quarterly workshops on high-end computing
  • Annual summer institute
  • Focused work on participants projects
  • Student expenses paid
  • Extreme I/O focus this year
  • Web-based training
  • CI Channel
  • Staff-run workshops at remote site

http//datacentral.sdsc.edu
14
Strategic Collaborations
  • Strategic Applications Collaborations (SAC)
  • SDSC staff paired with domain scientists for
    projects lasting 3-12 months
  • Strategic Community Collaborations (SCC)

http//datacentral.sdsc.edu
15
SCC Encyclopedia of Life
  • Information on the what and how of the proteins
    associated with 800 genomes
  • Protein annotation and 3D structure modeling of
    complete or partial genomes through a
    computational pipeline
  • Information stored in Web-accessible database
    that can be mined
  • International partner sites (Singapore, Japan)
    will use same EOL pipeline

http//datacentral.sdsc.edu
16
Data Central SAC Project Sloan Sky Survey
  • The Sloan Digital Sky Survey (www.sdss.org) will
    map one-quarter of the entire sky and perform a
    redshift survey of galaxies, quasars, and stars
  • The project requires a database migration from
    SQL Server to DB2
  • Moving 78 tables and 39 Views containing 6 TB of
    image data
  • Converting 168 functions and 126 stored
    procedures from T-SQL to PL SQL

http//datacentral.sdsc.edu
17
What is Data Science?
  • Data science applications rely on data for new
    discovery
  • Many compute applications are also data science
    applications -- the computation uses data as
    input or is the data source
  • Data Science application models include
  • Large computation requirements, access to or
    generation of big data
  • Extreme usage (very large data, very long-term
    data)

http//datacentral.sdsc.edu
18
What Science Communities are Doing
  • National Virtual Observatory (NVO) collections
  • Collections come from large-scale telescopes
  • Telescopes provide daily sweep of the sky,
    scientists clean data which is then converted
    from temporal to spatial data, allowing indexing
    over time dimension
  • All NVO data on website available to the public
    without restriction
  • Long-term preservation model is being developed

http//datacentral.sdsc.edu
19
Large computation Data Science SCEC --
Modeling the big one
  • Challenge What is the potential damage of a
    large-scale earthquake in Southern California?
  • Project Conduct large-scale simulation of a
    magnitude 7.7 seismic wave propagation on the San
    Andreas Fault to understand seismic consequences
  • Project leadership Tom Jordan (SCEC)Bernard
    Minster (SIO) Reagan Moore (SDSC) Carl
    Kesselman (ISI)

Project funded by NSF GEO
http//datacentral.sdsc.edu
20
Major Earthquakes on the San Andreas Fault,
1680-present
SCEC TeraShake
  • The SCEC TeraShake simulation is a result of
    immense effort from the Geoscience community for
    over 10 years
  • Focus is on understanding big earthquakes and how
    they will impact sediment-filled basins
  • Simulation combines massive amounts of data,
    high-resolution models, large-scale supercomputer
    runs
  • TeraShake results provide new information
    enabling better
  • Estimation of seismic risk
  • Emergency preparation, response and planning
  • Design of next generation of earthquake-resistant
    structures
  • Such simulations provide potentially immense
    benefits in saving both many lives and billions
    in economic losses

1906 M 7.8
1857 M 7.8
1680 M 7.7
How dangerous is the southern San Andreas Fault?
http//datacentral.sdsc.edu
21
TeraShake
  • Estimating the potential damage of a magnitude
    7.7 Southern California earthquake
  • Large-scale simulation of seismic wave
    propagation on the San Andreas Fault
  • 1.8 billion gridpoints
  • 240 DataStar processors
  • 1 TB memory
  • 5 days
  • 2 GB/s continuous I/O
  • 47 TB output

http//datacentral.sdsc.edu
22
Enabling Data Science
  • Many users with large data needs
  • extend above and beyond what their home
    environments
  • increasingly dependent on valued community data
    collections and databases used community-wide
  • Experiencing increasing demand by the domain
    communities for collaborations on
  • publishing of data in digital libraries
  • sharing of data through the Web and data grids
  • creating, optimizing, porting large scale
    databases
  • analyzing and data mining large scale data
  • Comprehensive data environment that incorporates
    access to the full spectrum of data enabling
    resources

http//datacentral.sdsc.edu
23
SDSC Data Allocations Environmentdatacentral.sdsc
.edu
Services
Parallel File-system High-speed, Temporary
Data Parking (SAN) High-speed, Short-term
Data Collections (SATA) Moderate-speed, Long-ter
m
Data Sharing (SATA) Moderate-speed, Medium-term
Disk
Local Back-up (e.g., Tape)
HPSS/SAMFS
Offsite Back-up
http//datacentral.sdsc.edu
24
Data Science Support Systems
Archival Systems
Blue Gene/L
6 PB
DataStar IBM Power4
Expertise, Networking, Visualization, Storage
and Compute Resources
5.7 TF
15.6 TF
1 PB on-line disk
http//datacentral.sdsc.edu
25
Partial list of databases and data collections
currently housed at SDSC
  • Protein Data Bank (protein data)
  • National Virtual Observatory (astronomical data)
  • UCSD Libraries Image Collegion (ArtStore)
  • National Science Digital Library (education
    collection)
  • SCEC (earthquake data)
  • BIRN (neuroscience data)
  • Encyclopedia of Life (genomic data)
  • TreeBase (phylogeny and ontology information)
  • Transport Classification Database (protein
    information)
  • Library of Congress data
  • CKAAPS (protein evolutionary information)
  • AfCS Molecule Pages (protein information)
  • SLACC-JCSG (structural genomics data)
  • APOPTOSIS DB (proteins related to cell death
    data)
  • NAVDAT (geochemistry data)
  • QRC (NSF data on Supercomputer Centers and PACI)
  • Network Topology Data (Skitter project)
  • UC Merced Library
  • Biology Workbench Databases (mirrors and
    originals of over 80 biology databases)
  • 2 Micron All Sky Survey (astronomy data)
  • Digital Palomar Observatory Sky Survey Collection
    (astronomy data)
  • Sloan Digital Sky Survey Collection (astronomy
    data)
  • Interpro Mirror (protein data)
  • HPWREN (Wireless Network Network Analysis Data)
  • HPWREN (sensor network data)
  • Security logs and archives (security information)
  • EarthRef Digital Archive (earth science
    information)
  • GERM (earth reservoir information)
  • Braindata (Rutgers neuroscience collection)
  • HyperLTER (hyperspectral images)
  • SIO-Explorer (oceanographic voyages)
  • Transana (classroom video)
  • WebBase (web crawls)
  • Alexandria Digital Library (photographs)
  • Backskatter Data (from UCSD network telescope)
  • Digital Earth Data Library (earth sciences
    related datasets)
  • GEON (PaleoGeographic Atlas project)
  • IMDC (Internet measurement data catalog)
  • Seamount Catalogue (bathymetric seamount maps)
  • Hayden Planetarium Collection (astronomical data)
  • TeraGrid Data (science and engineering
    collections)
  • Biocyc (collection of pathway/genome DBs)
  • Digital Embryo (human embryology)
  • National Archives (persistent archive)
  • San Diego Conservation Resources Network
    (sensitive species map server)
  • LDAS (land data assimilation system)
  • ROADNET (sensor data)
  • NPACI Data Grid (scientific simulation output)
  • Salk (biology data archive)
  • Backbone Packet Header Traces (OC48, OC12)
  • Teragrid (science and engineering collections)
  • CHRONOS (analytical tools for chronostratigraphy)
  • ERESE (educational Earth science portal)
  • TeraBridge (Sensor stream data)
  • C5 Landscape (UCSD Art dept)

26
Integrated Data Cyberinfrastructure
coordination
integration
http//datacentral.sdsc.edu
27
An Introduction to the
By Roman Olschanowsky roman2u_at_sdsc.edu
http//datacentral.sdsc.edu
28
Sites Using the SRB
http//datacentral.sdsc.edu
29
SDSC SRB Projects (60 million, .5 PB )
  • Digital Libraries
  • UCB, Umich, UCSB, Stanford,CDL
  • NSF NSDL - UCAR / DLESE
  • NASA Information Power Grid
  • Astronomy
  • National Virtual Observatory
  • 2MASS Project (2 Micron All Sky Survey)
  • Particle Physics
  • Particle Physics Data Grid (DOE)
  • GriPhyN
  • SLAC Synchrotron Data Repository
  • Medicine
  • Digital Embryo (NLM)
  • Earth Systems Sciences
  • ESIPS
  • LTER
  • Persistent Archives
  • NARA
  • LOC

http//datacentral.sdsc.edu
30
SDSC Storage Systems
Datastar
GPFS 108 TB
HPSS
Teragrid GPFS-WAN 210 TB
35 TB
Teragrid
GPFS 51 TB
6 PB Tape Capacity
SamQFS
400 TB
Bluegene (Intimidata)
GPFS 40 TB
http//datacentral.sdsc.edu
31
SRB Helps external users manage their Data
SRB
SRB
SDSC
1 PB Disk 6 PB Tape Capacity
SRB
SRB
SRB
SRB
http//datacentral.sdsc.edu
32
Storage Resource Broker (SRB)
  • A distributed file system (Data Grid)
  • Client-Server, Server-Server architecture.
  • Abstracts physical
  • SRB provides the ability to transparently share
    data across remote sites.
  • Heterogeneous Resources
  • Single sign on
  • Single logical file hierarchy

http//datacentral.sdsc.edu
33
What we are familiar with
http//datacentral.sdsc.edu
34
What we are not familiar with, yet
http//datacentral.sdsc.edu
35
Interfaces to theStorage Resource Broker
  • inQ Windows Client
  • Scommands UNIX, DOS Command line Client
  • Jargon Java API and GUI components
  • mySRB Web Client
  • Matrix WSDL, Data Grid Workflows
  • C, C C and C API
  • Python Python API
  • Perl Perl API

http//datacentral.sdsc.edu
36
Common Scommands (69 total)
  • Scp
  • Smv (logical)
  • Sphymove (physical)
  • Srm
  • Smkdir
  • Srmdir
  • Serror
  • Schmod
  • Sexit
  • Sinit
  • Senv
  • Spwd
  • Sls
  • Scd
  • Sget
  • Sput
  • Ssh

http//datacentral.sdsc.edu
37
mySRB
http//datacentral.sdsc.edu
38
BIRN Portal (perl based)
http//datacentral.sdsc.edu
39
NEEScentral Portal (php based)
http//datacentral.sdsc.edu
40
The BIRN SRB Data Grid
http//datacentral.sdsc.edu
41
The BIRN Data Grid
http//datacentral.sdsc.edu
42
The grid is in the details
http//datacentral.sdsc.edu
43
File Replication
  • Sls
  • /home/Demo/SRB-Tutorial/files-2
  • Doc.txt
  • Sls -l
  • /home/Demo/SRB-Tutorial/files-2
  • romanoly 0 z-ucsd-ncmir-nas1 15
    2003-07-09-05.15 Doc.txt
  • romanoly 1 z-jhu-cis-nas0
    15 2003-07-09-05.16 Doc.txt
  • romanoly 2 z-stanford-lucas-nas 15
    2003-07-09-05.16 Doc.txt
  • romanoly 3 z-umn-cmrr-nas0 15
    2003-07-09-05.16 Doc.txt
  • romanoly 4 z-uci-bic-nas0
    15 2003-07-09-05.17 Doc.txt

http//datacentral.sdsc.edu
44
SRB Location or Slave Server
SRB
Physical Resources z-jhu-cis-nas0
z-jhu-cis-nas1
Location
z-jhu-cis-nas2
SRB
Logical Resource
jhu-cis-nas
http//datacentral.sdsc.edu
45
Pooling physical resources
http//datacentral.sdsc.edu
46
Logical / Compound Resources
SRB
instant replication
fast archival
resource pooling
My-Resource
SRB
http//datacentral.sdsc.edu
47
In Conclusion
  • SRB handles large data and provides the ability
    to share and collaborate on distributed
    heterogeneous resources.
  • www.sdsc.edu/srb
  • srb_at_sdsc.edu

http//datacentral.sdsc.edu
48
Getting an Allocation Its Free!
  • Who should apply?
  • Open to researchers affiliated with US
    educational institutions
  • Proposals merit-reviewed quarterly by Data
    Allocations Committee
  • Types of Allocations
  • Expedited Allocations
  • 1 TB or less of disk tape 1st year
  • 5 GB Database 1st year
  • Yearly review
  • Medium Allocations
  • Under 30 TB
  • Large Allocations
  • Larger than 30 TB
  • Data Allocations
  • Getting Started http//datacentral.sdsc.edu

http//datacentral.sdsc.edu
49
Thank You
  • SDSC Data Resources and Allocations
  • http//datacentral.sdsc.edu/

http//datacentral.sdsc.edu
Write a Comment
User Comments (0)
About PowerShow.com