Extreme I/O - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Extreme I/O

Description:

Extreme IO – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 34
Provided by: philan2
Category:
Tags: extreme | piend

less

Transcript and Presenter's Notes

Title: Extreme I/O


1
Extreme I/O
Phil Andrews, Director of High End Computing
Technologies San Diego Supercomputer
Center University of California, San
Diego andrews_at_sdsc.edu
2
Applications are becoming more complex so are
their results
  • Every code should have just one output yes or
    no!-Hans Bruijnes, NMFECC, 1983
  • Google search headings, 9 p.m., July 24, 2005
  • Results 1 - 100 of about 61,000 English pages
    for "importance of data". (0.69 seconds) 
  • Results 1 - 100 of about 978 English pages
    for "importance of computing". (0.38 seconds) 

3
Some Data numbers
  • Enzo (Mike Norman) can output gt25 TB in a single
    run
  • Earthquake simulations can produce gt 50 TB
  • The entire NVO archive will contain about 100
    terabytes of data to start, and grow to more than
    10 petabytes by 2008. Brian Krebs

4
Computing Data is extremely important!
5
Data has a life cycle!
  • A computation is an event data is a living thing
    that is conceived, created, curated, consumed,
    and either deleted, archived, and/or forgotten.
  • For success in data handling, all aspects must be
    considered and facilitated from the beginning
    within an integrated infrastructure

6
SDSC TeraGrid Data Architecture
  • 1 PB disk
  • 6 PB archive
  • 1 GB/s disk-to-tape
  • Optimized support for DB2 /Oracle
  • Philosophy enable SDSC config to serve the grid
    as data center

LAN (multiple GbE, TCP/IP)
SAN GPFS Disk (500TB)
Local Disk (50TB)
Power 4 DB
DataStar
WAN (30 Gb/s)
HPSS
Sun F15K
Linux Cluster, 4TF
BG/L
SAN (2 Gb/s, SCSI)
SCSI/IP or FC/IP
30 MB/s per drive
200 MB/s per controller
FC Disk Cache (400 TB)
FC GPFS Disk (100TB)
Database Engine
Data Miner
Vis Engine
design leveraged at other TG sites
Silos and Tape, 6 PB, 1 GB/sec disk to tape 52
tape drives
7
Real Data Needs
  • Need balanced data transfers all the way from
    Memory Bandwidth to Archival Storage
  • Need Memory Bandwidths of better than 1 GB/s per
    processor Gflop. Ideally, two 64-bit words per
    flop, i.e., 16 bytes per flop.

8
File System Needs
  • File systems must allow 10 GB/s transfer rates.
  • File systems must work across arbitrary networks
    (LAN or WAN)
  • File Systems must be closely integrated with
    Archival systems for parallel backups and/or
    automatic archiving

9
I/O subsystem needs
  • No system degradation when disk fails (there will
    always be some down)
  • Tolerance of multiple disk failures per RAID set
    (likely with many thousands of disks per file
    system). There are over 7,000 spindles at SDSC.
  • Rapid transfers to Tape systems
  • Multi-TB tape cartridges, GB/s transfers

10
SDSC Configuration
ANL / PSC/ NCSA / CIT
Juniper T640
16 x 1Gb Ether
Force 10 - 12000
Force 10 - 12000
GridFTP Server SRB Server NFS Server
FC over IP Switch
IA-64 Linux Cluster
IA-64 Linux Cluster
11 x Brocade 12000 (1408 2Gb ports)
IBM Regatta
Brocade Silkworm 12000
Brocade Silkworm 12000
Sun Fire 15K
SAM-QFS
SAN-GPFS
SAN-GPFS
ETF DB
HPSS
32 FC Tape Drives
350 Sun FC Disk Arrays (4100 disks, 500 TB
total)
11
Parallel File Systems Across TeraGrid
  • General Parallel File System (GPFS)
  • High performance parallel I/O, over 10 GB/s at
    SDSC
  • SAN capability
  • Many redundancy features
  • Shared AIX-Linux
  • SDSC, NCSA, ANL
  • Parallel Virtual File System (PVFS)
  • Open source
  • Caltech, ANL, SDSC
  • HP Parallel File System (HP PFS)
  • Proprietary parallel file system for TeraScale
    Computing System (Lemieux) at PSC
  • Message Passing Interface IO (MPIIO)
  • High performance portable, parallel I/O interface
    for MPI programs

12
Local Data only Part of the Story
  • SDSC users are not Captive move around
  • SDSC is the designated data lead for TeraGrid
  • Many SDSC users are part of multi-site
    collaborations, whose major intersection is via
    common data sets
  • Must extend the data reach across the USA

13
TeraGrid Network
14
Working on
  • Global file system via GPFS
  • GSI authentication for GPFS using UID/GID mapping
    to Globus Mapfile
  • Dedicated disk/servers for Grid Data using GPFS
    to serve data across the Grid
  • Automatic migration to Tape archives
  • Online DB2 database servers to provide remote DB
    services to Grid users

15
Combine Resources Across TG
ANL
Caltech
Home Directory Node Local Storage Scratch/Staging
Storage Parallel Filesystem Archival System Cap.
140 GB NFS
4 TB NFS
70 GB/node
64 GB/node IA-64
80 TB PVFS
132 GB/node IA-32
16 TB PVFS
1.2 PB HPSS
SDSC
PSC
2 TB NFS
.5 TB NFS TCS
35 GB/node
38 GB/node TCS
100 TB QFS
24TB SLASH
64 TB GPFS
30 TB PFS
6 PB HPSS /
4 PB DMF
SAM-FS
16
TeraGrid Data Management Server
  • Sun Microsystems F15K
  • 72 900 MHz processors
  • 288 GB shared memory
  • 48 Fiber channel SAN interfaces
  • Sixteen 1Gb Ethernet interfaces
  • SAMQFS
  • High performance parallel file systems linking
    directly to archival systems for transparent
    usage
  • SAM-QFS and SLASH/DMF running now
  • Storage Resource Broker
  • Archival Storage (SAMFS)
  • Pool of storage with migration policies (like
    DMF)
  • 100 TB disk cache
  • 828 MB/s transfers to archive (using 23 9940B
    tape drives)
  • Parallel Filesystem (QFS)
  • Concurrent R/W
  • Metadata traffic over GE
  • Data transferred directly over SAN (GPFS does
    this too)
  • Demonstrated 3.2 GB/s reads from QFS file system
    with 30 TB

17
What Users Want in Grid Data
  • Unlimited data capacity. We can almost do this.
  • Transparent, High Speed access anywhere on the
    Grid. We can do this.
  • Automatic Archiving and Retrieval (yep)
  • No Latency. We cant do this.
  • (Measuring 60 ms roundtrip
  • SDSC-NCSA)

18
How do we do this ? One Way
  • Large, Centralized Tape Archive at SDSC (6 PB,
    capable of 1 GB/s)
  • Large, Centralized Disk Cache at SDSC (400 TB,
    capable of 10 GB/s)
  • Local Disk Cache at remote sites for low-latency,
    High Performance file access
  • Connect all 3 in a multi-level HSM across
    TeraGrid with transparent archiving (reads and
    writes) across all 3 levels

19
Infinite Grid Storage?
  • Infinite (SDSC) storage available over the grid
  • Looks like local disk to grid sites
  • Use automatic migration with a large cache to
    keep files always online and accessible
  • Data automatically archived without user
    intervention
  • Want one pool of storage for all systems and
    functions
  • Combination of Global Parallel File System (GPFS)
    on both AIX and Linux with transparent archival
    migration would allow mounting of unlimited
    archival storage systems as local file system
  • Users could have local parallel file system
    (highest performance, not backed up) and global
    parallel file system (integrated into HSM) both
    mounted for use
  • Need Linux, AIX clients

20
Global File Systems over WAN
  • Basis for some new Grids (DEISA)
  • User transparency (TeraGrid roaming)
  • On demand access to scientific data sets
  • Share scientific data sets and results
  • Access scientific results from geographically
    distributed instruments and sensors in real-time
  • No copying of files to here and there and there
  • What about UID, GID mapping?
  • Authentication
  • Initially use World Readable DataSets and common
    UIDs for some users. GSI coming
  • On demand Data
  • Instantly accessible and searchable
  • No need for local storage space
  • Need network bandwidth

21
SAM-FS
22
SC'02 Export of SDSC SAN across 10 Gbps WAN to
PSC booth
Baltimore
10 Gb IP
San Diego
FC/IP
FC/IP
WAN
8 Gb FC
SDSC Booth
Fibre Connection
  • Fibre Channel (FC) over IP boxes, FC over SONET
    encoding
  • Encapsulate FC frames within IP
  • Akara and Nishan, 8 Gb/s gear by Nishan Systems
  • 728 MB/s reads from disk to memory over SAN
    writes slightly slower
  • 13 TB disk
  • 8 x 1 Gbps link
  • Single 1 Gb/s link gave 95 MB/s
  • Latency approx. 80 ms round trip

PSC Booth
23
High Performance Grid-Enabled Data Movement with
GridFTPBandwidth Challenge 2003
SCinet
L.A. Hub
Booth Hub
TeraGrid Network
Southern California Earthquake Center
data Scalable Visualization Toolkit
Gigabit Ethernet
Gigabit Ethernet
Myrinet
Myrinet
SAN
SAN
SDSC
SDSC SC Booth
128 1.3 GHz dual Madison processor nodes 77 TB
General Parallel File System (GPFS) on SAN
40 1.5 GHz dual Madison processor nodes 40
TB GPFS on SAN
24
Access to GPFS File Systems over the WAN
  • Goal sharing GPFS file systems over the WAN
  • WAN adds 10-60 ms latency
  • but under load, storage latency is much higher
    than this anyway!
  • Typical supercomputing I/O patterns latency
    tolerant (large sequential read/writes)
  • New GPFS feature
  • GPFS NSD now allows both SAN and IP access to
    storage
  • SAN-attached nodes go direct
  • Non-SAN nodes use NSD over IP
  • Work in progress
  • Technology demo at SC03
  • Work toward possible product release

Roger Haskin, IBM
25
On Demand File Access over the Wide Area with
GPFSBandwidth Challenge 2003
SCinet
L.A. Hub
Booth Hub
TeraGrid Network
Southern California Earthquake Center
data Scalable Visualization Toolkit
Gigabit Ethernet
Gigabit Ethernet
Myrinet
Myrinet
SAN
SDSC SC Booth
SDSC
40 1.5 GHz dual Madison processor
nodes GPFS mounted over WAN No duplication of data
128 1.3 GHz dual Madison processor nodes 77 TB
General Parallel File System (GPFS) 16 Network
Shared Disk Servers
26
Global TG GPFS over 10 Gb/sWAN (SC03 Bandwidth
Challenge Winner)
27
GridFTP Across 10 Gb/s WAN (SC 03 Bandwidth
Challenge Winner)
28
PSCs DMF HSM is interfaced to SDSCs tape
archival system
  • Used FC/IP encoding via WAN-SAN to attach 6 SDSC
    tape drives to PSCs DMF Archival System
  • Approximately 19 MB/s aggregate to tapes at first
    try

29
StorCloud Demo SC04
  • StorCloud 2004
  • Major initiative at SC2004 to highlight the use
    of storage area networking in high-performance
    computing
  • 1 PB of storage from major vendors for use by
    SC04 exhibitors
  • StorCloud Challenge competition to award entrants
    that best demonstrate the use of storage (similar
    to Bandwidth Challenge)
  • SDSC-IBM StorCloud Challenge
  • A workflow demo that highlights multiple
    computation sites on a grid sharing storage at a
    storage siteIBM computing and storage hardware
    and the 30 Gb/s communications backbone of the
    Teragrid

40-node GPFS server cluster
Installing the DS4300 Storage
IBM DS4300 Storage in StorCloud booth
30
SC 04 Demo IBM-SDSC-NCSA
TG network
L.A.
SDSC booth 4 racks of nodes 1 rack of networking
TeraGrid
SDSC
SCinet
40 1.3 GHz dual Itanium2 processor Linux nodes 4
racks of nodes 1 GE/node 3 FC/node GPFS NSD
Servers Export /gpfs-sc04
Gigabit Ethernet
DataStar
Gigabit Ethernet
11 32-Power4 processors p690 AIX nodes Possible
7 nodes with 10 GE adapters
Brocade SAN switch
1
2
3
11
1
2
3
40
SAN Switches
StorCloud booth 15 racks of disks 1.5 racks of
SAN switch
Federation SP switch
3 Brocade 24000 switches 128 ports each 360 ports
total used
1
176
181 TB raw FastT600 disk 15 racks 4
controllers/rack 4 FC/controller 240 total FC
from disks 2 Ethernet ports/controller 120 total
Ethernet ports
176 8-Power4 processor p655 AIX nodes Mount
/gpfs-sc04
Gigabit Ethernet
SAN Disks
31
SC 04 Demo IBM-SDSC-NCSA
  • Nodes scheduled using GUR
  • ENZO computation on DataStar, output written to
    StorCloud GPFS served by nodes in SDSCs SC 04
    booth
  • Visualization performed at NCSA using StorCloud
    GPFS and displayed to showroom floor

TG network
L.A.
Chicago
SDSC
NCSA
SCinet
10 Gigabit Ethernet
Gigabit Ethernet
SC 04 SDSC booth
TeraGrid
Gigabit Ethernet
TeraGrid
DataStar
40 1.3 GHz dual Itanium2 processor Linux
nodes GPFS NSD Servers
40 1.5 GHz dual Itanium2 processor
nodes Visualization /gpfs-sc04 mounted
Federation SP switch
Brocade SAN switch
3 Brocade 24000 switches
176 8-Power4 processor p655 AIX nodes 7
32-Power4 processor p690 AIX nodes with 10 GE
adapters /gpfs-sc04 mounted
SC 04 StorCloud booth
160 TB FastT600 disk 15 racks, 2 controllers/rack
32
WAN-GFS POC continued at SC04
33
SDSC now serving 0.5 PB GFS disk
  • Initially served across TeraGrid and mounted by
    ANL and NCSA
  • Plan to start hosting large datasets for the
    scientific community
  • One of first will be NVO, 50 TB of Night Sky
    information Read-Only dataset available for
    computation across TeraGrid
  • Extend rapidly with other datasets
  • Hoping for 1 PB soon

34
Global File System For the TeraGrid
NCSA
PSC
TeraGrid Network
30 Gbps to Los Angeles
Juniper T640
ANL
Force 10 - 12000
IA-64 GPFS Servers
TeraGrid Linux 3 TeraFlops
BG/L 6 TeraFlops 128 I/O Nodes
DataStar 11 TeraFlops
.5 PetaByte FastT100 Storage
Parallel File System exported from San
Diego Mounted at all TeraGrid sites
Write a Comment
User Comments (0)
About PowerShow.com