Clusters to Supercomputers - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Clusters to Supercomputers

Description:

Nobody does server-side refresh anymore. Job positions at NCAR and CU ... Don't submit from a shell script without a `sleep 1' statement. Storage space considerations ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 49
Provided by: matthe297
Category:

less

Transcript and Presenter's Notes

Title: Clusters to Supercomputers


1
Clusters to Supercomputers
Schenks System Administration
April 2008
  • Matthew Woitaszek
  • University of Colorado, Boulder
  • NCAR Computer Science Section
  • mattheww_at_ucar.edu

Presented for Chris Schenks CSCI 4113 Unix
System Administration
2
Were Hiring!
  • System Administrators
  • Software Developers
  • Web Technology Geeks
  • Nobody does server-side refresh anymore
  • Job positions at NCAR and CU
  • Full-time, part-time, and occasional

3
Outline
  • Motivation
  • My other computer is a
  • Parallel Computing
  • Processors
  • Networks
  • Storage
  • Software
  • Grid Computing
  • Software
  • Platforms

4
James Demmels Reasons for HPC
  • Traditional scientific and engineering paradigm
  • Do theory or paper design
  • Perform experiments or build system
  • Replacing both by numerical experiments
  • Real phenomena are too complicated to model by
    hand
  • Real experiments are
  • too hard, e.g., build large wind tunnels
  • too expensive, e.g., build a throw-away passenger
    jet
  • too slow, e.g., wait for climate or galactic
    evolution
  • too dangerous, e.g., weapons, drug design

5
Time-Critical Simulations
Realtime Computing Model and simulate to predict
The forecast simulation has to be done before the
weather happens.
  • NCARs time-critical HPC simulations
  • Mesoscale meterology
  • Global climate
  • My favorite Traffic simulations
  • Require more than a single processor to complete
    in a reasonable amount of time

6
Performance Vector vs. Parallel MM5 (1999)
J. Dorband, J. Kouatchou, J. Michalakes, and U.
Ranawake, Implementing MM5 on NASA Goddard Space
Flight Center computing systems a performance
study, 1999.
7
Performance POP 640x768 (2003)
POP on Xeon memory bandwidth limit
M. Woitaszek, M. Oberg, and H. M. Tufo,
Comparing Linux Clusters for the Community
Climate Systems Model, 2003.
8
Realtime Computing by Parallelization
Minneapolis I-494 Highway Simulation (1997)
Traffic Flow Simulation 15.5 miles 17 exit
ramps 19 entry ramps ?t .5 s ?d 100 ft T 24
hr
9
Performance Realtime Computing by Parallelization
  • Legacy Code and Hardware
  • Intel P133 single processor
  • 65.7 minutes (3942 seconds)simulation time
  • Massively Parallel Implementation
  • Cray T3E 450 MHz Alpha 21164
  • 67.04 seconds with 1 PE 60x faster than P133
  • 6.26 seconds with 16 PEs 629x
  • 2.39 seconds with 256 PEs 1649x
  • Wait! If it takes 1 minute on 1 PE, shouldnt it
    take67 seconds / 16 4.10 s on 16 PEs, or
    67 seconds / 256 0.25 s on 256 PEs?

C. Johnston and A. Chronopolus, The
parallelization of a highway traffic flow
simulation, 1999.
10
Speedup and Overhead
  • Amdahls Law
  • The part you dont optimize comes back to haunt
    you!
  • Speedup is limited by
  • Memory latency
  • Disk I/O bottlenecks
  • Network bandwidth and latency
  • Algorithm

11
Performance HOMME on BlueGene/L (2007)
G. Bhanot, J.M. Dennis, J. Edwards, W. Grabowski,
M. Gupta, K. Jordan, R.D. Loft, J. Sexton, A.
St-Cyr, S.J. Thomas, H.M. Tufo, T. Voran, R.
Walkup, and A.A. Wyszogrodski, "Early Experiences
with the 360TF IBM BlueGene/L Platform,"
International Journal of Computational Methods,
2006.
12
Performance HOMME on BlueGene/L (2007)
13
Outline
  • Motivation
  • My other computer is a
  • Parallel Computing
  • Processors
  • Networks
  • Storage
  • Software
  • Grid Computing
  • Software
  • Platforms

14
Processors
From Gregory Pfisters In Search of Clusters
Source Gregory Pfisters In Search of Clusters
15
Parallel Architectures
CPU

A processor.
RAM
16
Parallel Architectures
A cache-coherent non-uniform distributed shared
memory (ccNUMA) cluster of chip-multiprocessor
(CMP) symmetric multiprocessors (SMP).
CPU
CPU
L1
L1
RAM
Ln
RAM
Ln
L1
L1
CPU
CPU
17
Scalable Parallel Architectures
64 racks in system
32 node cards per rack
32 chips per node card
Two chips per card
Two CPUs per chip
  • Emerging massively parallel architectures
  • IBM BlueGene 131072 chips (can be 2x
    processors)
  • Multi-core commodity architectures
  • AMD Opteron, now Intel

Source IBM
18
Networks
  • Network types
  • Message passing (MPI)
  • File system
  • Job control
  • System monitoring
  • Technologies and Competitors
  • 1Gbps Ethernet and RDMA
  • 10Gbps Ethernet
  • Fixed Topology (3D Torus, Tree, Scali, etc.)
  • Switched (Infiniband, Myrinet)

19
Gigabit Ethernet Performance (2006)
RDMA
Legacy
Motherboard
  • RDMA has highest throughput (switched
    configuration)110 MB/s RDMA, 66 MB/s legacy, 45
    MB/s motherboard

M. Oberg, H. M. Tufo, T. Voran, and M. Woitaszek,
Evaluation of RDMA Over Ethernet Technology for
Building Cost Effective Linux Clusters, May
2006.
20
RDMA for High-Performance Applications
  • Single network interface for all communications
  • RDMA for MPI, DAPL (Direct Access Programming
    Library), and Sockets Direct Protocol (SDP)
  • RDMA bypasses operating system kernel
  • Legacy interface for standard operating system
    TCP/IP

User Space Application
User Space Application
OS Kernel
OS Kernel
RDMA NIC
RDMA NIC
  • Zero-copy, interrupt-free RDMA for MPI
    applications

21
Interconnect Performance (2006)
Atoll Benchmarking Results
Manufacturers Ratings
1.5 Mbps 192 KB/s and 1Gbps 125 MB/s
22
10Gbps Ethernet Performance (2007)
  • Ethernet approaches 10Gbps (and can be trunked!)
  • Infiniband (4x) reported at 8Gbps sustainable

23
Storage
Archival Storage Tape silo systems (3 PB)
Supercomputers and local working storage (1 100
TB per system)
Archive Management and disk cache controller
Visualization Systems and local working storage
Grid Gateway GridFTP Servers
Shared Storage Cluster with shared file
system (100 500 TB)
24
Thousands of Disks
25
The Single Server Limitation
Just you
SHARED with others
Aggregate bandwidth decreases with increasing
concurrent use!
26
Cluster File Systems Read Rate (2005)
27
Cluster File Systems Write Rate (2005)
28
Table of Administrator Pain and Agony
  • Original goal was to fit file system in
    environment
  • File system influences operating system stack
  • GPFS required a commercial OS and a specific
    kernel version
  • Lustre required a commercial OS and a specific
    kernel patch
  • TerraFS required a custom kernel

29
Bullet Points of Administrator Pain and Agony
  • Remain responsive even in failure conditions
  • Filesystem failure should not interrupt standard
    UNIX commands used by administrators
  • ls la /mnt or df should not hang the console
  • Zombies should respond to kill s 9
  • Support clean normal and abnormal termination
  • Support both service start and shutdown commands
  • Provide an Emergency Stop feature
  • Cut losses and let the administrators fix things

Never hang Linux reboot command
30
Block-Based Access is Complicated!
Logical access size
Filesystem block size
One file shared for writes on four nodes (cyclic
mapping)
1
6
2
5
3
8
4
7
Logical file view and physical block placement on
two servers
1
2
1
2
1
2
1
2
1
2
1
Server 2
Server 1
Consider the overhead of correlating blocks to
servers (Example Where is the first byte of the
red data stored?)
(Adapted from May, 2001, p. 79)
31
Blue Gene/L Single-Partition Performance (2008)
32
Blue Gene/L Storage Performance (2008)
33
Software
  • Parallel Execution
  • MPI
  • Job Control
  • Batch queues PBS, Torque/Maui
  • Libraries
  • Optimized math routines
  • BLAS, LAPACK
  • The next slides show what we tell the users

34
The Batch Queue System
  • Batch queues control access to compute nodes
  • Please dont ssh to a node and run programs
  • Please dont mpirun on the head node itself
  • People expect to have the whole node for
    performance runs!
  • Resource management
  • Flags and disables offline nodes (down or
    administrative)
  • Matches job requests to nodes
  • Reserves nodes preventing oversubscribing
  • Scheduling
  • Queue prioritization spreads CPUs among users
  • Queue limits prevent a single user from hogging
    the cluster

35
PBS Queues on CSC Systems
  • Debugging for HPSC students
  • Limited to 8 nodes, 10 minutes
  • Default queue for friendly jobs
  • Limited to 16 nodes, 24 hours
  • Queue for large and long running jobs
  • No resource limit, only 1 running job
  • Queue for users with special projects approved by
    the people in charge

speedq
friendlyq
workq
reservedq
36
PBS Commands Queue Status
  • What jobs are running?
  • What jobs are waiting?
  • matthew_at_hemisphere qstat -a
  • hemisphere.cs.colorado.edu

  • Req'd Req'd Elap
  • Job ID Username Queue Jobname
    SessID NDS TSK Memory Time S Time
  • --------------- -------- -------- ----------
    ------ --- --- ------ ----- - -----
  • 320102.hemisphe jkihm friendly WL_SIMP_17
    7032 1 1 1024mb 2400 R 0952
  • 320103.hemisphe jkihm friendly WL_SIMP_18
    7078 1 1 1024mb 2400 R 0952
  • 320355.hemisphe jkihm friendly WL_SIMP_17
    4537 1 1 1024mb 2400 R 0818
  • 320388.hemisphe jkihm friendly WL_SIMP_25
    -- 1 1 1024mb 2400 Q --
  • 320389.hemisphe jkihm friendly WL_SIMP_25
    -- 1 1 1024mb 2400 Q --
  • 320390.hemisphe jkihm friendly WL_SIMP_30
    -- 1 1 1024mb 2400 Q --
  • 321397.hemisphe barcelos workq missile
    21769 16 32 -- 0112 R 0004

37
Playing Nicely in the Cluster Sandbox
  • Security considerations
  • Dont share your account or your files (orw)
  • Dont put the current directory (.) in your path
  • Compute time considerations
  • Dont submit more than 10-60 jobs to PBS at a
    time
  • Dont submit from a shell script without a sleep
    1 statement
  • Storage space considerations
  • Keep large input and output sets in /quicksand,
    not /home
  • Dont keep large files around forever compress
    or delete
  • Please store your personal media collections
    elsewhere

Dont use a password you have ever used anywhere
else!
38
Outline
  • Motivation
  • My other computer is a
  • Parallel Computing
  • Processors
  • Networks
  • Storage
  • Software
  • Grid Computing
  • Software
  • Platforms

39
Sharing Computing and Data with Grids
  • Grids link computers together more than a
    network!
  • Networks connect computers
  • Grids allow distant computers to work on a single
    problem
  • Services look like web servers
  • HTTP for data transfer
  • XML Simple Object Access Protocol (SOAP) instead
    of HTML
  • Grid services
  • Metadata and Discovery Services (WS MDS)
  • Job execution (WS GRAM)
  • Data transfer (GridFTP)
  • Workflow management (thats what we do!)

40
Grid-BGC Carbon Cycle Model
J. Cope, C. Hartsough, S. McCreary, P. Thornton,
H. M. Tufo, N. Wilhelmi, and M. Woitaszek,
Experiences from Simulating the Global Carbon
Cycle in a Grid Computing Environment, 2005.
41
TeraGrid Extensible Terascale Facility
42
A National Research Priority
All figures are in millions.
2000
36
Terascale Computing System PSC
2001
45
Distributed Terascale Facility NCSA, SDSC, ANL,
CalTech
2002
35
Extensible Terascale Facility PSC
2003
150
TeraGrid Extension (10M) Ops IU, Purdue, ORNL,
TACC
2007
65
Track 2 Mid-Range HPC ORNL, TACC, NCAR
2007
208
Track 1 Blue Waters Petascale UIUC / NCSA
http//www.nsf.gov/news/news_summ.jsp?cntn_id1098
50 http//www.nsf.gov/news/news_summ.jsp?cntn_id1
06875
43
A Few TeraGrid Resources
44
Challenges and Definitions
  • Power consumption
  • BlueVista 276 kilowatts
  • Average U.S. home 10.5 kilowatts
  • Physical space
  • Whats the difference between a cluster and a
    supercomputer?
  • Price
  • Number of SMP processors in a compute node
  • Network used to connect nodes in the cluster

45
(No Transcript)
46
Cluster Administration
  • Parallel and distributed shells
  • pdsh
  • dsh
  • sudo pdsh w node001-27 /etc/init.d/sshd
    restart
  • Configuration file management
  • IBM CSM
  • xCAT
  • Automated operating system installation

47
Cluster Security
  • The most important question
  • Centralized inaccessible logging
  • Intrusion detection
  • Custom scripts
  • Network monitoring difficult at 10Gbps
  • Desperate measures
  • Extreme firewalling (but dont depend on it!)
  • Virtual hosting for services
  • One-time passwords (RSA SecureID, CryptoCard)

How do you know if youve been compromised?
48
Questions?
Schenks System Administration
April 2008
  • Matthew Woitaszek
  • mattheww_at_ucar.edu
  • Thanks to my CU and NCAR colleagues
  • Jason Cope, John Dennis, Bobby House,
  • Rory Kelly, Dustin Leverman, Paul Marshal,
  • Michael Oberg, Henry Tufo, and Theron Voran
Write a Comment
User Comments (0)
About PowerShow.com