MURI Hardware Resources - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

MURI Hardware Resources

Description:

Presents itself as a single 32-CPU Linux machine ... Stability, documentation, and maturity at time of testing found to be inadequate. ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 19
Provided by: erik228
Category:

less

Transcript and Presenter's Notes

Title: MURI Hardware Resources


1
MURI Hardware Resources
  • Ray Garcia
  • Erik Olson

Space Science and Engineering Center at the
University of WI - Madison
2
Resources for Researchers
  • CPU cycles
  • Memory
  • Storage space
  • Network
  • Software
  • Compilers
  • Models
  • Visualization programs

3
Original MURI hardware
  • 16 PIII processors
  • Storage server with 0.5 TB
  • Gigabit networking
  • Purpose
  • Provide working environment for collaborative
    development.
  • Enable running of large multiprocessor MM5 model.
  • Gain experience working with clustered systems.

4
Capabilities and Limitations
  • Successfully ran initial MM5 model runs,
    algorithm development (fast model), and modeling
    of GIFTS optics (FTS simulator).
  • MM5 model runs for 140 by 140 domains. One 270 by
    270 run with very limited time steps.
  • OpenPBS system scheduling hundreds of jobs.
  • Idle CPU time given to FDTD raytracing.
  • Expanded to 28 processors using funding from B.
    Baum, IPO, and others.
  • However, MM5 model runtime limited domain size
    and storage space limited number of output time
    steps.

5
CY2003 Upgrade
  • NASA provided funding for 11 Dual-Pentium4
    processor nodes
  • 4GB DDR-RAM
  • 2.4GHz CPUs
  • Expressly purposed for running large IHOP field
    program simulations (400 by 400 grid point
    domain).

6
Cluster Mark 2
  • Gains
  • Larger scale model runs and instrument
    simulations as needed for IHOP
  • Terabytes of experimental and simulation data
    online through NAS, hosted RAID arrays
  • Limitations to further work at even larger scale
  • Interconnect limitations slowed large model runs
  • 32-bit memory limitation on huge model set-up
    jobs for MM5 and WRF
  • Increasing number of small storage arrays

7
3 Years of Cluster Work
  • Inexpensive
  • Adding CPUs to the system
  • Costly
  • Adding users to the system
  • Adding storage to the system
  • Easily understood
  • Matlab
  • Not so well-understood
  • Distributed system (computing, storage)
    capabilities

8
Along comes DURIP
  • H.L.Huang / R.Garcia DURIP proposal awarded May
    2004.
  • Purpose Provide hardware for next generation
    research and education programs.
  • Scope Identify computing and storage systems to
    serve the need to expand simulation, algorithm
    research, data assimilation and limited
    operational product generation experiments.

9
Selecting Computing Hardware
  • Cluster options for numerical modeling were
    evaluated and found to require significant time
    investment.
  • Purchased SGI Altix fall of 2004 after extensive
    test runs with WRF and MM5.
  • 24 - Itanium2 processors running Linux
  • 192GB of RAM
  • 5TB of FC/SATA disk
  • Recently upgraded to 32 CPUs, 10TB storage.

10
SGI Altix Capabilities
  • Large, contiguous RAM allows 1600 by 1600 grid
    point domain (gt CONUS area at 4 km res).
  • Largest so far is 1070 by 1070.
  • NUMAlink interconnect provides fast turn around
    for model runs
  • Presents itself as a single 32-CPU Linux
    machine
  • Intel compilers for ease of porting and
    optimizing Fortran/C on 32-bit and 64-bit
    hardware.

11
Storage Class Home Directory
  • Small size for source code (preferably also held
    under CVS control) and critical documents
  • Nightly incremental backups
  • Quota enforcement
  • Current implementation
  • Local disks on cluster head
  • Backup by TC

12
Storage Class Workspace
  • Optimized for speed
  • Automatic flushing of unused files
  • No insurance against disk failure
  • Users expected to move important results to
    Long-term Storage
  • Current implementation
  • RAID5 or RAID0 drive arrays within the cluster
    systems

13
Storage Class Long-term
  • Large amount of space
  • Redundant, preferably back-up to tape
  • Managed directory system, preferably with
    metadata
  • Current implementation
  • Lots of project-owned NAS devices with partial
    redundancy (RAID5)
  • NFS spaghetti
  • Ad-hoc tape backup

14
DURIP phase 2 Storage
  • Long term storage scaling and management goals
  • Reduce or eliminate NFS spaghetti
  • Include hardware phase-in / phase-out strategy in
    purchase decision
  • Acquire the hardware to seed a Storage Area
    Network (SAN) in the Data Center, improving
    uniformity and scalability
  • Reduce overhead costs (principally human time)
  • Work closely with Technical Computing group on
    system setup and operations for a long-term
    facility

15
Immediate Options
  • Red Hat GFS
  • Size limitations and hardware/software
    mix-and-match Support costs make up for free
    source code.
  • HP Lustre
  • More likely to be a candidate for workspace.
    Expensive.
  • SDSC SRB (Storage Resource Broker)
  • Stability, documentation, and maturity at time of
    testing found to be inadequate.
  • Apple Xsan
  • Plays well with third-party storage hardware.
    Straightforward to configure and maintain.
    Affordable.

16
Dataset Storage Purchase Plan
  • 64-bit storage servers and meta-data server
  • Qlogic Fibre channel switch
  • Move data between hosts, drive arrays
  • SAN software to provide distributed filesystem
  • Focusing on Apple Xsan for 1-3 year span
  • Follow up with 1-year assessment with option of
    re-competing
  • Storage arrays
  • Competing Apple XRAID, Western Scientific Tornado

17
Target System for 2006
  • Scalable dataset storage accessible from
    clusters, workstations, and supercomputer
  • Backup strategy
  • Update existing cluster nodes to ROCKS
  • Simplified management and improve uniformity
  • Proven on other clusters deployed by SSEC
  • Retire/repurpose slower cluster nodes
  • Reduce bottlenecks to workspace disk
  • Improve ease of use and understanding

18
Long-term Goals
  • 64-bit shared memory system scaled to huge job
    requirements (Altix)
  • Complementary compute farm migrating to x86-64
    (Opteron) hardware
  • Improved workspace performance
  • Scalable storage with full metadata for long-term
    and published datasets
  • Software development tools for multiprocessor
    algorithm development
Write a Comment
User Comments (0)
About PowerShow.com