Title: Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters
1Introduction to the NPACI Rocks Clustering
ToolkitBuilding Manageable COTS Clusters
- Philip M. Papadopoulos,
- Mason J. Katz,
- Greg Bruno
2Who We Are
- Philip Papadopoulos
- Parallel message passing expert (PVM and Fast
Messages) - Mason Katz
- Network protocol expert (x-kernel, Scout and Fast
Messages) - Greg Bruno
- 10 years experience with NCRs Teradata Systems
- Builders of clusters which drive very large
commercial databases - All three of us have worked together for the past
2 years building NT and Linux clusters
3Who is NPACI Rocks ?
- Key people from the UCB Millennium Group
- Prof. David Culler
- Eric Fraser
- Brent Chun
- Matt Massie
- Albert Goto
- People from SDSC
- Bruno, Katz, Papadopoulos (Distributed Computing
Group) - Kenneth Yoshimoto (Scheduling)
- Keith Thompson, Bill Link (Grid)
- Storage Resource Broker (SRB) Group
- You !
4Why We Do Clusters Frankly, we love it
- Building high-performance systems which have
killer price/performance is a gas - NPACI is about building pervasive infrastructure.
Supported, transferable cluster infrastructure
was missing from our portfolio. - Enabling others to build their own clusters and
do scientific simulation is a blast. - We wanted a management system that would allow us
to rapidly experiment with new low-level system
software (and recover when things didnt go quite
right) - Protect ourselves from ourselves ?
5What Well Cover
- Rocks philosophies
- Hardware components
- Software packages
- Theory and practice
- Lab
6What we thought we Learned
- Clusters are phenomenal price/performance
computational engines, but are hard to manage - Cluster management is a full-time job which gets
linearly harder as one scales out. - Heterogeneous Nodes are a bummer (network,
memory, disk, MHz, current kernel version).
7You Must Unlearn What You Have Learned
8Installation/Management
- Need to have a strategy for managing cluster
nodes - Pitfalls
- Installing each node by hand
- Difficult to keep software on nodes up to date
- Management increases as node count increases
- Disk Imaging techniques (e.g.. VA Disk Imager)
- Difficult to handle heterogeneous nodes
- Treats OS as a single monolithic system
- Specialized installation programs (e.g. IBMs
LUI, or RWCPs Multicast installer) - let Linux packaging vendors do their job
- Penultimate
- RedHat Kickstart
- Define packages needed for OS on nodes, kickstart
gives a reasonable measure of control. - Need to fully automate to scale out (Rocks)
9Scaling out
- Evolve to management of two systems
- The front end(s)
- Log in host
- Users home areas, passwords, groups
- Cluster configuration information
- The compute nodes
- Disposable OS image
- Let software manage node heterogeneity
- Parallel (re)installation
- Cluster-wide configuration files derived through
reports from a MySQL database (DHCP, hosts, PBS
nodes, )
10NPACI Rocks Toolkit rocks.npaci.edu
- Techniques and software for easy installation,
management, monitoring and update of clusters - Installation
- Bootable CD floppy which contains all the
packages and site configuration info to bring up
an entire cluster - Management and update philosophies
- Trivial to completely reinstall any (all) nodes.
- Nodes are 100 automatically configured
- Use of DHCP, NIS for configuration
- Use RedHats Kickstart to define the set of
software that defines a node. - All software is delivered in a RedHat Package
(RPM) - Encapsulate configuration for a package (e.g..
Myrinet) - Manage dependencies
- Never try to figure out if node software is
consistent - If you ever ask yourself this question, reinstall
the node
11More Rocksisms
- Leverage widely-used (standard) software wherever
possible - Everything is in RedHat Packages (RPM)
- RedHats kickstart installation tool
- SSH, Telnet, Existing open source tools
- Write only the software that we need to write
- Focus on simplicity
- Commodity components
- For example x86 compute servers, Ethernet,
Myrinet - Minimal
- For example no additional diagnostic or
proprietary networks - Rocks is a collection point of software for
people building clusters - It will evolve to include cluster software and
packaging from more than just SDSC and UCB - ltyour-software.i386.rpm your-software.src.rpm
heregt
12Hardware
13Many variations on a basic layout
Front-end Node(s)
Power Distribution (Net addressable units as
option)
Public Ethernet
Fast-Ethernet Switching Complex
Gigabit Network Switching Complex
14Frontend and Compute Nodes
- Choices
- Uni or Dual, Intel Processors
- Linux is, in reality, an Intel OS
- Rackmount vs. Desktop chassis
- Rackmount essential for large installations
- SCSI vs. IDE
- Performance is a non-issue
- Price and serviceability are the real
considerations - Note rackmount servers usually are SCSI
- User integration versus system integrator
- Our Nodes
- Dual PIIIs (733, 800 and 933 MHz Compaq, IBM)
- 1.0 GHz as we expand
- ½ GB node (1 GB would be better)
- Hot swap SCSI on these nodes
- We integrate our hardware
15Networks
- High-performance networks
- Myrinet, Giganet, Servernet, Gigabit Ethernet,
etc. - Ethernet only ? Beowulf-class
- Management Networks (Light Side)
- Ethernet 100 Mbit
- Management network used to manage compute nodes
and launch jobs - Nodes are in Private IP (192.168.x.x) space,
front-end does NAT - Ethernet 802.11b
- Easy access to the cluster via laptops
- Plus, wireless will change your life
- Evil Management Networks (Dark Side)
- A serial console network is not necessary
- A KVM (keyboard/video/monitor) switching system
adds too much complexity, cables, and cost
16Power Distribution
Ethernet port
Power sockets
- Highly desirable to have network addressable
power distribution units - Can remotely power cycle compute nodes
- Instrumented which help determine power needs
17Other Helpful Hardware When All Else Fails
- When a node appears to be sick
- Issue a reinstall command over the network
- If still dead, instruct the network addressable
power distribution unit to power cycle the node
(this reinstalls the OS) - If still dead, roll up the crash cart
- Monitor and keyboard
18Leatherman A Must-Have For Any Self-Respecting
Clusters Person
19Current Configuration of the Meteor Cluster
- Rocks v2.0
- 2 Frontends
- 100 nodes
- 50 GB RAM
- Ethernet
- For management
- Myrinet
- Servernet
- Working through some bugs
20Software
21RedHat Supplied Software
- 7.0 Base Updates
- RPM
- RedHat Package Manager
- Kickstart
- Method for unattended server installation
22Community Software
- Myricoms General Messaging (GM)
- MPICH
- GM device
- Ethernet device
- Portable Batch System
- Maui
- PVM
- Intels Math Kernel Library
- Math functions tuned for Intel processors
23NPACI Rocks Software
- Cluster-dist
- A tool used to assemble the latest RedHat,
community and Rocks packages into a distribution
which is used by compute nodes during
reinstallation - Shoot-node and eKV (Ethernet Keyboard and Video)
- Initiate a compute node reinstallation
- Monitor compute node reinstallations over
Ethernet with telnet - Cluster-admin and cluster-ssl
- Tools to create user accounts and user SSL
certificates - Rexec (UC Berkeley)
- Launch and control parallel jobs (SSL-based
authentication) - Ganglia (UC Berkeley)
- Cluster monitoring
24Software Details
25Cluster-dist
- Integrate RedHat Packages from
- Redhat (mirror) base distribution updates
- Contrib directory
- Locally produced packages
- Packages from rocks.npaci.edu
- Produces a single updated distribution that
resides on front-end - Is a RedHat Distribution with patches and updates
applied - Different Kickstart files and different
distribution can co-exist on a front-end to add
flexibility in configuring nodes.
26Remote re-installationShoot-node and eKV
- Rocks provides a simple method to remotely
reinstall a node (once it has been installed the
first time) - By default, hard power cycling will cause a node
to reinstall itself. - With no serial (or KVM) console, we are able to
watch a node as installs
27Remote re-installationShoot-node and eKV
192.168.254.254
Remotely starting reinstallation on two nodes
192.168.254.253
28Starting Jobs
- SSH-based MPI-Launch
- Provides full integration with Myrinet
reservation capability of Usher/Patron - SSL-Based Rexec
- Better control of jobs on remote nodes
- Sane signal propagation
- Batch System PBS Maui
- PBS provides queue definition and node monitoring
- Maui has rich scheduling policies
- Standing and Future Reservations
- Query number of available now nodes
29PBS Portable Batch System
- Three standard components to PBS
- MOM Node health reporting daemon, Job Launch
daemon on every node - Server On front-end queue definition,
aggregation of node information - Scheduler Policies for what job to run out of
which queue at what time - We added a fourth
- Configuration Get cluster node configuration
from our SQL database.
30PBS RPM Packaging
- Repackaged PBS (Sane packaging enhancements)
- Added chkconfig-compatible start-up scripts
- 4 packages
- pbs (server and scheduler) (should be divided
again) - pbs-mom
- pbs-config-sql (Python script to generate
database report) - pbs-common (files needed by all three packages)
- A Rocks 2.0 base installation (automatically)
defines a default queue with all nodes being
available in the queue - http//pbs.mrj.com is a good starting point for
PBS
31PBS Server defaults (and changing them)
- Startup script /etc/rc.d/init.d/pbs-server
start - /usr/apps/pbs/pbs.default
- Sourced every time pbs is started
- Id pbs.default.in,v 1.5 2001/02/16 195938
bruno Exp -
- A basic pbs setup that creates a queue called
default and starts scheduling -
- Create queues and set their attributes.
-
-
- Create and define queue default
-
- 1 node default, 1hr walltime
- create queue default
- set queue default queue_type Execution
- set queue default resources_default.nodes 1
- set queue default resources_default.walltime
10000 - set queue default enabled True
- set queue default started True
32PBS.defaults (contd)
-
- Set server attributes.
-
- Assume maui scheduler will be installed
- set server managers maui_at_frontend-0
- set server operators maui_at_frontend-0
- set server default_queue default
- set server log_events 511
- set server mail_from adm
- set server scheduler_iteration 600
- set server schedulingfalse
- PBS will ignore queue creation if a queue already
exists.
33Modifying the default setup (simple queue
creation)
- Use qmgr to create a new queue
- /usr/apps/pbs/bin/qmgr
- Max open servers 4
- Qmgr create queue single
- Qmgr set queue single queue_typeexecution
- Qmgr set queue single enabledtrue
- Qmgr set queue single acl_hostscompute-1-0
- Qmgr set queue single startedtrue
- Use qmgr command to save configuration
- /usr/apps/pbs/bin/qmgr -c "print server gt
/usr/apps/pbs/pbs.default
34Maui Scheduler
- We use Maui as our scheduler for PBS
- mauischeduler.sourceforge.net
- http//havi.supercluster.org/documentation/maui
- Add the single queue definition so that Maui
understands. This is in /usr/spool/maui/maui.cfg - SRNAME0 single
- SRHOSTLIST0 compute-1-0
- Restart Maui
- /etc/rc.d/init.d/maui restart
- Submit a job to PBS
- /usr/apps/pbs/bin/qsub q single mytest.sh
35Monitoring your cluster
- PBS has a GUI called xpsmon. Gives a nice
graphical view of up/down state of nodes - SNMP status
- Use the extensive SNMP MIB defined by the Linux
community to find out many things about a node - Installed software
- Uptime
- Load
- Ganglia (UCB) IP Multicast-based monitoring
system
36Ganglia - http//www.millennium.berkeley.edu/gangl
ia/
- Dendrite on each node
- Multicasts state of the machine on significant
changes - Load averages, disk consumption, memory, etc.
- Beacons every minute, if no significant deltas
- Axons
- Collection daemons (at least one/cluster)
- Ganglia client Sort the measured variables to
find a set of hosts that match a desired criteria - E.g. X MB free memory, load below Y
- Can act as a vexec resource for Rexec.
37Ganglia text output
- phil_at_slic01 /usr/sbin/ganglia load_one
- compute-1-5 0.07
- compute-0-9 0.08
- compute-1-3 0.14
- compute-2-0 0.15
- compute-2-8 0.18
- compute-2-5 0.27
- frontend-0 0.36
- compute-3-11 0.82
- compute-23 1.06
- compute-22 1.19
- compute-3-4 1.96
- compute-3-9 1.99
- compute-3-10 1.99
- compute-3-2 2.00
- compute-3-3 2.09
- compute-3-7 2.12
- compute-3-5 2.99
- compute-3-6 3.0
38Hidden Software
39Some Tools that assist in automation.
- Users generally will not see these tools
- Profile scripts run at users first login
- Usher-patron (Myrinet port reservation)
- Insert-ethers (Add nodes to a cluster)
- Cluster-sql package
- Reports to build service-specific config files
- Cluster-admin
- Node reinstallation
- Creating accounts (NIS, auto.home map creation)
- Cluster-ssl
- Generate keys for SSL authentication (rexec)
40Usher/Patron
- Tool to simplify using installed Myricom Hardware
- Eliminates a central database to decide which
Myrinet ports are currently in use - (Myricom driver installed with a separate source
RPM) - Usher daemon runs on each compute node. Takes
reservation requests for access to the limited
set of Myrinet ports (RPC-based) - Reservations time out, if not claimed.
- Patron works with usher to request and claim
ports - Integrated with MPI-Launch
- Automatically creates node file need for MPICH-GM
41First Login Profile Scripts
- On first login, all users, including root, are
prompted to build an SSH public/private key pair - Makes sense because ssh is the only way to gain
login access to the nodes - NIS is updated (passwd, auto.home, etc.)
- Additionally, if its the first time root has
logged in, a SSL certificate authority is
generated which is used to sign users SSL
certificates - The SSL certificate and roots public SSH key are
then propagated the to compute node kickstart file
42insert-ethers
- Used to populate the nodes MySQL table
- Parses a file (e.g., /var/log/messages) for
DHCPDISCOVER messages - Extracts MAC addr and, if not in table, adds MAC
addr and hostname to table - For every new entry
- Rebuilds /etc/hosts and /etc/dhcpd.conf
- Reconfigures NIS
- Restarts DHCP and PBS
- Hostname is
- ltbasenamegt-ltcabinetgt-ltchassisgt
- Configurable to change hostname
- E.g., when adding new cabinets
43dhcp_options One More Important MySQL Table
- Created by the Frontend kickstart file (based on
user input from Rocks configuration web page) - Used by makedhcp to construct the header in
/etc/dhcpd.conf
44Configuration Derived from Database
Automated node discovery
mySQL DB
Node 0
insert-ethers
Node 1
makehosts
makedhcp
pbs-config-sql
Node N
/etc/hosts
/etc/dhcpd.conf
pbs node list
45Futures
- Attack the storage problem
- Keep the global view of storage that NFS gives
us, but address the scalability problem - Source high bandwidth from the cluster into the
WAN - Apply our cluster bring-up automation to easily
attach clusters to the grid - Continue to improve cluster monitoring
- Configure a monitoring GUI (e.g., NetSaint) to
extract data from Ganglia - Get node health (Fan Speed, Temp., Disk Error
rate) into Ganglia - Technologies
- Processors IA-64 and Alpha
- Networks Infiniband
- 2.4 kernel (Will rev our distribution at RedHat
7.1)
46Lab
47Front-end Node
- Node seen by external world
- Performs Network Address Translation (NAT)
- NFS Server(s) for user home areas
- Beware of scalability issues!
- Compilers, libraries
- Configuration for Nodes
- DHCP Server, NIS Domain Controller, NTP Server,
Web Server, MySQL Server - Installation Server for defining system on nodes
- Method(s) to start jobs on compute nodes
- Batch System (PBS Maui)
- Interactive launching of jobs
48Installing a Front-end Machine
- Build ks.cfg from https//rocks.npaci.edu/site.htm
- Define your root password
- NIS Domain
- Public IP Address
- Boot CD
- Full ISO image for download. Burn your own!
- Enter frontend at the boot prompt.
- Sit back. Time varies depending on speed of the
CPU and CDROM of frontend - Entire distribution is being copied to
/home/install/cluster-dist
49Building a Distribution with cluster-dist
- Directory structure
- Build mirror
- From mirror host
- Emulates mirroring from rocks
- Build distro
- cluster-dist dist
50Installing Compute Nodes
- Login as root to frontend
- Execute tail f /var/log/messages
insert-ethers - Back on the compute node
- Boot CD
- From laptop
- Examine MySQL database through browser
51Reinstalling Compute Nodes
- With shoot-node
- Frontend-driven reinstallation
- With power on/off
- Hail Mary to recover from bad software state
52Structure of a Rocks Kickstart File
- Site.h
- Look at ks.cfg.in
- preamble
- packages
- pre
- post
- post and post nochroot
- include
- For more info
- http//redhat.com/support/manuals/RHL-7-Manual/ref
-guide/ch-kickstart2.html
53Updating and Augmenting Your Distribution with
cluster-dist
- Add new package to compute node in ks.cfg.in
- Kickstart the compute node
- Add new package to distro, then have compute node
pick it up - Put a package in contrib
- Build distro -gt kickstart
- Add a new local package (usher)
- Bump version number
- Build rpm
- Show where it the RPM gets put
- Build distro
- Show new home for usher RPM
- Build distro -gt kickstart
54Adding Users with create-account
- As root
- create-account yoda
- passwd yoda
- ssl-genuser yoda
- Ypcat
- Shows how data made it into NIS
- Ypcat passwd
- Ypcat auto.home
55Look at cluster nodes with Netscape
56Launching and Controlling Jobs
- REXEC
- Rexec sleeper
- PBS
- Add a new queue to Maui
57Cluster Monitoring with ganglia
- Look at multiple different values from ganglia
- ganglia
- Lists commands
- ganglia load_one
- ganglia cpu_system cpu_idle cpu_nice
- Go to millennium.berkeley.edu to see live demo
58Auto configuration of node
- Pop in Myrinet Card
- After reinstallation, startup script uses lspci
to determine if Myrinet card is on the PCI bus - If yes
- Automatically compile Myrinet device driver from
source rpm - Install Myrinet module
- Module is then guaranteed compatible with running
kernel - Eliminates managing binary device drivers for
different kernel configurations