Title: NPACI All Hands Meeting 2002 User Feedback Session
1NPACI All Hands Meeting 2002User Feedback Session
- Session Chair Dr. Jay Boisseau, TACC
- Speakers Dr. Kent Milfeld, TACC
- Dr. Bill Martin, U. Michigan
- Don Frederick, SDSC
- Friday, March 8, 2002
2Organization of HPC Resourcesat TACCKent
Milfeldmilfeld_at_tacc.utexas.edu
University of Texas at Austin
Texas Advanced Computing Center
3TACC Resources
Allocatable Machines
- SV1, Vector Processing
- T3E, MPP (RISC) Processing
- April, Regatta (Power4)
Infrastructure
- Storage Archive (GigE machine room network)
- IA-32/64 Linux Cluster Computing (NPACI Rocks)
- Integrated Prog. Env. (modules, looks feels
like NPACI machine) - VISualization Facilities (world class)
- Mini TeraGrid (12-mile OC-48 Network experiment
next month) - Grid Team Development, Support and Portals
- NPACI/TACC Training Consulting
4NPACI-TACC Resources
- SV1 16CPU 16 GB Memory (Vector)
- Memory Bandwidth
- Gather/Scatter
- 300MHz x 4 flop/CP 1200 MFLOPS
- -OpenMP, MPI
(RISC)
- RISC with streams
- Cache
- 300MHz x 2 flop/CP 600 MFLOPS
- High Speed Interconnect
- -MPI
5TACC Regatta HPC longhorn.tacc.utexas.edu
- 64 IBM Power4 1.3 GHz Processors
- Arranged as 4 16-way SMP (now)
- 32 GB Memory/Node (128GB total)
- 1TB disk
- 1/3 TFLOPS (peak)
- Early Summer
- Interconnected by High Speed Switch
- (1-2 GB/sec. point to point, theoretical)
6TACC Regatta HPC
7Storage RobotPetabyte capacity
8IBM p690 HPC Design Configuration
135 watts / die x 4 ?HOT!!!
9TACC Visualization Lab
- SGI Onyx2
- 24 CPUs, 6 Infinite Reality 2 Graphics Pipelines
- 24 GB Memory, 1TB Disk
- Front and Rear Projection Systems
- 3x1 cylindrically-symmetric Power Wall
- 5x2 large-screen, 169 panel Power Wall
10Power4
11TACC IA-32 System 64 Compute Processors
32 compute nodes 2-way SMPs 1 GB mem./node 1
GHz Pentium III (IBM x330)
18 GB local disk per node
20 GB
/work
GigE 125MB/sec
Switch
32 lines
20 GB
x340
/home
100Base-T 12.5MB/sec
login node
32 lines
M3-SW16
x340
3/4 TB
/gpfs
x340
250MB/sec
Myrinet
2 GPFS nodes
12TACC IA-64 System 40 Compute Processors
20 compute nodes 2-way SMPS Intel Itanium (IBM
x380) 2 GB mem./node
800 MHz
Switch
20 lines
125MB/sec
100Base-T
Fast Ethernet 12.5MB/sec
GigE
140GB
/work
x380
23GB
/home
login node
32 GB local disk per node
x340
TBD
/gpfs
x340
2 gpfs nodes
Myrinet 250MB/sec
M3-SW16
late spring, 02
20 lines
Myrinet
13User Feedback SessionAHM2002
-
- Bill Martin
- Director, NPACI Midrange Site
- Director, Center for Advanced Computing
- University of Michigan
- March 8, 2002
14Who we are .
- Tom Hacker head of Systems Support team
- Rod Mach
- Matt Britt
- Abhijit Bose head of User Support team
- Randy Crawford
- David Woodcock
- Contributing faculty
- Quentin Stout (EECS)
- John Volakis (EECS)
- Linda Abriola (Civil and Environmental
Engineering)
15The UM Mid-Range Site Operate and maintain HPC
equipment
- 112 cpu SP2 (160 MHZ) system including 64 cpu SP2
from SDSC soon to be 176 nodes with the 64 cpu
SP2 from SDSC (via Texas) - 24 cpu (3 8-way nodes) Nighthawk (375 MHZ) system
will add 4 interactive nodes soon - Built and operated 100 cpu (soon to be 128 cpu)
Intel cluster (Pentium III) during past year - Operate mass store system (Timberwolf/Tivoli)
16Systems support local and distributed
- Three full time staff
- Operate and self-maintain all IBM equipment
- Developed joint job submission system with Texas
with their SP2 for NPACI allocations - Participate on development team for SRB (ported
to Tivoli) - Use SRB for Visible Human Project
17User support and expert consultation
- Three full time user support staff (2 PhDs 1 MS)
- Assist in NPACI 800 hotline ( 260 Remedy
tickets in 2001) for all NPACI platforms
including data resources - Work at algorithm and numerical methods level
- Monte Carlo photon cancer treatment therapy (Y.
Dewarja) - Gene sequence alignment and optimization (R.
Goldstein) - Environmental remediation simulator (MISER) for
EPA - More demand than capacity for user support
- Absolutely critical for effective utilization of
parallel systems. Recall quote by Charlie Catlett
yesterday - User support, user support, user support
18Workshops and Distance Training
- Developed several web-based modules for parallel
computing - Using the UM SP2 system
- Domain decomposition
- OpenMP
- Parallel Object-Oriented Programming
- Linux Clusters
- Parallel computing workshops (at Michigan)
- Fall NPACI Workshops (2x) 106 signed up 87
attended - Summer parallelization workshop 42 attendees
19Michigan and Texas collaboration has yielded
improved user interface
- Co-scheduling SP2 systems ( one virtual SP2
system) with single queue (Load Leveler), enables
load balancing between sites - Shared file space (single AFS cell)
- Data intensive computing infrastructure (SRB,
AFS) - Coordinated account management and accounting
systems - May be viewed as developing, testing, and
deploying prototype Grid technologies in a
production environment
20New high end cluster at Michigan ..
- 256 node AMD cluster (Athlon 32 bit)
- 1.55 GHZ, 1 GB/cpu, Myrinet 2000
- Assembled by Atipa the first installment (100
CPUs) is now operational - Partnering with other UM research groups to
increase size to gt 500 cpu - Will exceed 2 teraflops peak
- Allocable NPACI resource 2/3 of system
21(No Transcript)
22AHM 02 - NPACI User Feedback Session
SDSC Current Future Resources Donald
Frederick, Scientific Computing
Department 858-534-5020, frederik_at_sdsc.edu
23Current SDSC Resources - 2002
24SDSC TeraGrid System Future Resource 2003
ANL 1 TF .25 TB Memory 25 TB disk
Caltech 0.5 TF .4 TB Memory 86 TB disk
Chicago LA DTF Core Switch/Routers Cisco 65xx
Catalyst Switch (256 Gb/s Crossbar)
NCSA 62 TF 4 TB Memory 240 TB disk
SDSC 4.1 TF 2 TB Memory 225 TB SAN
vBNS Abilene Calren ESnet
OC-12
OC-12
OC-12
OC-3
4
HPSS 300 TB
Myrinet
4
10
1176p IBM SP 1.7 TFLOPs Blue Horizon
Sun Server
4
16
2 x Sun E10K
25All TeraGrid Sites Have Focal Points
- SDSC The TeraGrid Data Place
- Large-scale and high-performance data
analysis/handling - Every Cluster Node is Directly Attached to SAN
26Basic Cluster Components
- Systems actual HW configuration not settled
- IA-64 McKinley-based IBM node candidate cpu
- 23 GB Memory/CPU
- Connectivity
- Gigabit Ethernet in every node (multiple?)
- Myrinet network in every node (multiple?)
- Storage
- Local Disk (gt 73GB)
- Access to large secondary, tertiary storage
- Primarily Open Source software stack
- Linux, cluster software, Grid software
- Proprietary where it makes sense (compilers,
debugger, etc.)
27TeraGrid Storage
- Storage in 4 flavors
- Local node, up to 91GB/node
- 2 SCSI drives/node
- Secondary storage on each site
- 0.6 PB across the sites
- Locally accessible at each site
- Secondary storage from remote sites
- Metadata management requires serious effort
- Data location and replication with SRB
- Unique SDSC configuration with dedicated Sun
Starcat server - Expected as major use of WAN
- Tertiary storage on each site
- Locally accessible, needs to be integrated with
TeraGrid