Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters

About This Presentation

Title:

Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters

Description:

Builders of clusters which drive very large commercial databases ... Bootable CD floppy which contains all the packages and site configuration info ... – PowerPoint PPT presentation

Number of Views:210

Avg rating:3.0/5.0

Slides: 59

Provided by: philip339

Learn more at: http://web.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters

1
Introduction to the NPACI Rocks Clustering
ToolkitBuilding Manageable COTS Clusters

Philip M. Papadopoulos,
Mason J. Katz,
Greg Bruno

2
Who We Are

Philip Papadopoulos
Parallel message passing expert (PVM and Fast
Messages)
Mason Katz
Network protocol expert (x-kernel, Scout and Fast
Messages)
Greg Bruno
10 years experience with NCRs Teradata Systems
Builders of clusters which drive very large
commercial databases
All three of us have worked together for the past
2 years building NT and Linux clusters

3
Who is NPACI Rocks ?

Key people from the UCB Millennium Group
Prof. David Culler
Eric Fraser
Brent Chun
Matt Massie
Albert Goto
People from SDSC
Bruno, Katz, Papadopoulos (Distributed Computing
Group)
Kenneth Yoshimoto (Scheduling)
Keith Thompson, Bill Link (Grid)
Storage Resource Broker (SRB) Group
You !

4
Why We Do Clusters Frankly, we love it

Building high-performance systems which have
killer price/performance is a gas
NPACI is about building pervasive infrastructure.
Supported, transferable cluster infrastructure
was missing from our portfolio.
Enabling others to build their own clusters and
do scientific simulation is a blast.
We wanted a management system that would allow us
to rapidly experiment with new low-level system
software (and recover when things didnt go quite
right)
Protect ourselves from ourselves ?

5
What Well Cover

Rocks philosophies
Hardware components
Software packages
Theory and practice
Lab

6
What we thought we Learned

Clusters are phenomenal price/performance
computational engines, but are hard to manage
Cluster management is a full-time job which gets
linearly harder as one scales out.
Heterogeneous Nodes are a bummer (network,
memory, disk, MHz, current kernel version).

7
You Must Unlearn What You Have Learned
8
Installation/Management

Need to have a strategy for managing cluster
nodes
Pitfalls
Installing each node by hand
Difficult to keep software on nodes up to date
Management increases as node count increases
Disk Imaging techniques (e.g.. VA Disk Imager)
Difficult to handle heterogeneous nodes
Treats OS as a single monolithic system
Specialized installation programs (e.g. IBMs
LUI, or RWCPs Multicast installer)
let Linux packaging vendors do their job
Penultimate
RedHat Kickstart
Define packages needed for OS on nodes, kickstart
gives a reasonable measure of control.
Need to fully automate to scale out (Rocks)

9
Scaling out

Evolve to management of two systems
The front end(s)
Log in host
Users home areas, passwords, groups
Cluster configuration information
The compute nodes
Disposable OS image
Let software manage node heterogeneity
Parallel (re)installation
Cluster-wide configuration files derived through
reports from a MySQL database (DHCP, hosts, PBS
nodes, )

10
NPACI Rocks Toolkit rocks.npaci.edu

Techniques and software for easy installation,
management, monitoring and update of clusters
Installation
Bootable CD floppy which contains all the
packages and site configuration info to bring up
an entire cluster
Management and update philosophies
Trivial to completely reinstall any (all) nodes.
Nodes are 100 automatically configured
Use of DHCP, NIS for configuration
Use RedHats Kickstart to define the set of
software that defines a node.
All software is delivered in a RedHat Package
(RPM)
Encapsulate configuration for a package (e.g..
Myrinet)
Manage dependencies
Never try to figure out if node software is
consistent
If you ever ask yourself this question, reinstall
the node

11
More Rocksisms

Leverage widely-used (standard) software wherever
possible
Everything is in RedHat Packages (RPM)
RedHats kickstart installation tool
SSH, Telnet, Existing open source tools
Write only the software that we need to write
Focus on simplicity
Commodity components
For example x86 compute servers, Ethernet,
Myrinet
Minimal
For example no additional diagnostic or
proprietary networks
Rocks is a collection point of software for
people building clusters
It will evolve to include cluster software and
packaging from more than just SDSC and UCB
ltyour-software.i386.rpm your-software.src.rpm
heregt

12
Hardware
13
Many variations on a basic layout
Front-end Node(s)
Power Distribution (Net addressable units as
option)
Public Ethernet
Fast-Ethernet Switching Complex
Gigabit Network Switching Complex
14
Frontend and Compute Nodes

Choices
Uni or Dual, Intel Processors
Linux is, in reality, an Intel OS
Rackmount vs. Desktop chassis
Rackmount essential for large installations
SCSI vs. IDE
Performance is a non-issue
Price and serviceability are the real
considerations
Note rackmount servers usually are SCSI
User integration versus system integrator
Our Nodes
Dual PIIIs (733, 800 and 933 MHz Compaq, IBM)
1.0 GHz as we expand
½ GB node (1 GB would be better)
Hot swap SCSI on these nodes
We integrate our hardware

15
Networks

High-performance networks
Myrinet, Giganet, Servernet, Gigabit Ethernet,
etc.
Ethernet only ? Beowulf-class
Management Networks (Light Side)
Ethernet 100 Mbit
Management network used to manage compute nodes
and launch jobs
Nodes are in Private IP (192.168.x.x) space,
front-end does NAT
Ethernet 802.11b
Easy access to the cluster via laptops
Plus, wireless will change your life
Evil Management Networks (Dark Side)
A serial console network is not necessary
A KVM (keyboard/video/monitor) switching system
adds too much complexity, cables, and cost

16
Power Distribution
Ethernet port
Power sockets

Highly desirable to have network addressable
power distribution units
Can remotely power cycle compute nodes
Instrumented which help determine power needs

17
Other Helpful Hardware When All Else Fails

When a node appears to be sick
Issue a reinstall command over the network
If still dead, instruct the network addressable
power distribution unit to power cycle the node
(this reinstalls the OS)
If still dead, roll up the crash cart
Monitor and keyboard

18
Leatherman A Must-Have For Any Self-Respecting
Clusters Person
19
Current Configuration of the Meteor Cluster

Rocks v2.0
2 Frontends
100 nodes
50 GB RAM
Ethernet
For management
Myrinet
Servernet
Working through some bugs

20
Software
21
RedHat Supplied Software

7.0 Base Updates
RPM
RedHat Package Manager
Kickstart
Method for unattended server installation

22
Community Software

Myricoms General Messaging (GM)
MPICH
GM device
Ethernet device
Portable Batch System
Maui
PVM
Intels Math Kernel Library
Math functions tuned for Intel processors

23
NPACI Rocks Software

Cluster-dist
A tool used to assemble the latest RedHat,
community and Rocks packages into a distribution
which is used by compute nodes during
reinstallation
Shoot-node and eKV (Ethernet Keyboard and Video)
Initiate a compute node reinstallation
Monitor compute node reinstallations over
Ethernet with telnet
Cluster-admin and cluster-ssl
Tools to create user accounts and user SSL
certificates
Rexec (UC Berkeley)
Launch and control parallel jobs (SSL-based
authentication)
Ganglia (UC Berkeley)
Cluster monitoring

24
Software Details
25
Cluster-dist

Integrate RedHat Packages from
Redhat (mirror) base distribution updates
Contrib directory
Locally produced packages
Packages from rocks.npaci.edu
Produces a single updated distribution that
resides on front-end
Is a RedHat Distribution with patches and updates
applied
Different Kickstart files and different
distribution can co-exist on a front-end to add
flexibility in configuring nodes.

26
Remote re-installationShoot-node and eKV

Rocks provides a simple method to remotely
reinstall a node (once it has been installed the
first time)
By default, hard power cycling will cause a node
to reinstall itself.
With no serial (or KVM) console, we are able to
watch a node as installs

27
Remote re-installationShoot-node and eKV
192.168.254.254
Remotely starting reinstallation on two nodes
192.168.254.253
28
Starting Jobs

SSH-based MPI-Launch
Provides full integration with Myrinet
reservation capability of Usher/Patron
SSL-Based Rexec
Better control of jobs on remote nodes
Sane signal propagation
Batch System PBS Maui
PBS provides queue definition and node monitoring
Maui has rich scheduling policies
Standing and Future Reservations
Query number of available now nodes

29
PBS Portable Batch System

Three standard components to PBS
MOM Node health reporting daemon, Job Launch
daemon on every node
Server On front-end queue definition,
aggregation of node information
Scheduler Policies for what job to run out of
which queue at what time
We added a fourth
Configuration Get cluster node configuration
from our SQL database.

30
PBS RPM Packaging

Repackaged PBS (Sane packaging enhancements)
Added chkconfig-compatible start-up scripts
4 packages
pbs (server and scheduler) (should be divided
again)
pbs-mom
pbs-config-sql (Python script to generate
database report)
pbs-common (files needed by all three packages)
A Rocks 2.0 base installation (automatically)
defines a default queue with all nodes being
available in the queue
http//pbs.mrj.com is a good starting point for
PBS

31
PBS Server defaults (and changing them)

Startup script /etc/rc.d/init.d/pbs-server
start
/usr/apps/pbs/pbs.default
Sourced every time pbs is started
Id pbs.default.in,v 1.5 2001/02/16 195938
bruno Exp
A basic pbs setup that creates a queue called
default and starts scheduling
Create queues and set their attributes.
Create and define queue default
1 node default, 1hr walltime
create queue default
set queue default queue_type Execution
set queue default resources_default.nodes 1
set queue default resources_default.walltime
10000
set queue default enabled True
set queue default started True

32
PBS.defaults (contd)

Set server attributes.
Assume maui scheduler will be installed
set server managers maui_at_frontend-0
set server operators maui_at_frontend-0
set server default_queue default
set server log_events 511
set server mail_from adm
set server scheduler_iteration 600
set server schedulingfalse
PBS will ignore queue creation if a queue already
exists.

33
Modifying the default setup (simple queue
creation)

Use qmgr to create a new queue
/usr/apps/pbs/bin/qmgr
Max open servers 4
Qmgr create queue single
Qmgr set queue single queue_typeexecution
Qmgr set queue single enabledtrue
Qmgr set queue single acl_hostscompute-1-0
Qmgr set queue single startedtrue
Use qmgr command to save configuration
/usr/apps/pbs/bin/qmgr -c "print server gt
/usr/apps/pbs/pbs.default

34
Maui Scheduler

We use Maui as our scheduler for PBS
mauischeduler.sourceforge.net
http//havi.supercluster.org/documentation/maui
Add the single queue definition so that Maui
understands. This is in /usr/spool/maui/maui.cfg
SRNAME0 single
SRHOSTLIST0 compute-1-0
Restart Maui
/etc/rc.d/init.d/maui restart
Submit a job to PBS
/usr/apps/pbs/bin/qsub q single mytest.sh

35
Monitoring your cluster

PBS has a GUI called xpsmon. Gives a nice
graphical view of up/down state of nodes
SNMP status
Use the extensive SNMP MIB defined by the Linux
community to find out many things about a node
Installed software
Uptime
Load
Ganglia (UCB) IP Multicast-based monitoring
system

36
Ganglia - http//www.millennium.berkeley.edu/gangl
ia/

Dendrite on each node
Multicasts state of the machine on significant
changes
Load averages, disk consumption, memory, etc.
Beacons every minute, if no significant deltas
Axons
Collection daemons (at least one/cluster)
Ganglia client Sort the measured variables to
find a set of hosts that match a desired criteria
E.g. X MB free memory, load below Y
Can act as a vexec resource for Rexec.

37
Ganglia text output

phil_at_slic01 /usr/sbin/ganglia load_one
compute-1-5 0.07
compute-0-9 0.08
compute-1-3 0.14
compute-2-0 0.15
compute-2-8 0.18
compute-2-5 0.27
frontend-0 0.36
compute-3-11 0.82
compute-23 1.06
compute-22 1.19
compute-3-4 1.96
compute-3-9 1.99
compute-3-10 1.99
compute-3-2 2.00
compute-3-3 2.09
compute-3-7 2.12
compute-3-5 2.99
compute-3-6 3.0

38
Hidden Software
39
Some Tools that assist in automation.

Users generally will not see these tools
Profile scripts run at users first login
Usher-patron (Myrinet port reservation)
Insert-ethers (Add nodes to a cluster)
Cluster-sql package
Reports to build service-specific config files
Cluster-admin
Node reinstallation
Creating accounts (NIS, auto.home map creation)
Cluster-ssl
Generate keys for SSL authentication (rexec)

40
Usher/Patron

Tool to simplify using installed Myricom Hardware
Eliminates a central database to decide which
Myrinet ports are currently in use
(Myricom driver installed with a separate source
RPM)
Usher daemon runs on each compute node. Takes
reservation requests for access to the limited
set of Myrinet ports (RPC-based)
Reservations time out, if not claimed.
Patron works with usher to request and claim
ports
Integrated with MPI-Launch
Automatically creates node file need for MPICH-GM

41
First Login Profile Scripts

On first login, all users, including root, are
prompted to build an SSH public/private key pair
Makes sense because ssh is the only way to gain
login access to the nodes
NIS is updated (passwd, auto.home, etc.)
Additionally, if its the first time root has
logged in, a SSL certificate authority is
generated which is used to sign users SSL
certificates
The SSL certificate and roots public SSH key are
then propagated the to compute node kickstart file

42
insert-ethers

Used to populate the nodes MySQL table
Parses a file (e.g., /var/log/messages) for
DHCPDISCOVER messages
Extracts MAC addr and, if not in table, adds MAC
addr and hostname to table
For every new entry
Rebuilds /etc/hosts and /etc/dhcpd.conf
Reconfigures NIS
Restarts DHCP and PBS
Hostname is
ltbasenamegt-ltcabinetgt-ltchassisgt
Configurable to change hostname
E.g., when adding new cabinets

43
dhcp_options One More Important MySQL Table

Created by the Frontend kickstart file (based on
user input from Rocks configuration web page)
Used by makedhcp to construct the header in
/etc/dhcpd.conf

44
Configuration Derived from Database
Automated node discovery
mySQL DB
Node 0
insert-ethers
Node 1
makehosts
makedhcp
pbs-config-sql
Node N
/etc/hosts
/etc/dhcpd.conf
pbs node list
45
Futures

Attack the storage problem
Keep the global view of storage that NFS gives
us, but address the scalability problem
Source high bandwidth from the cluster into the
WAN
Apply our cluster bring-up automation to easily
attach clusters to the grid
Continue to improve cluster monitoring
Configure a monitoring GUI (e.g., NetSaint) to
extract data from Ganglia
Get node health (Fan Speed, Temp., Disk Error
rate) into Ganglia
Technologies
Processors IA-64 and Alpha
Networks Infiniband
2.4 kernel (Will rev our distribution at RedHat
7.1)

46
Lab
47
Front-end Node

Node seen by external world
Performs Network Address Translation (NAT)
NFS Server(s) for user home areas
Beware of scalability issues!
Compilers, libraries
Configuration for Nodes
DHCP Server, NIS Domain Controller, NTP Server,
Web Server, MySQL Server
Installation Server for defining system on nodes
Method(s) to start jobs on compute nodes
Batch System (PBS Maui)
Interactive launching of jobs

48
Installing a Front-end Machine

Build ks.cfg from https//rocks.npaci.edu/site.htm
Define your root password
NIS Domain
Public IP Address
Boot CD
Full ISO image for download. Burn your own!
Enter frontend at the boot prompt.
Sit back. Time varies depending on speed of the
CPU and CDROM of frontend
Entire distribution is being copied to
/home/install/cluster-dist

49
Building a Distribution with cluster-dist

Directory structure
Build mirror
From mirror host
Emulates mirroring from rocks
Build distro
cluster-dist dist

50
Installing Compute Nodes

Login as root to frontend
Execute tail f /var/log/messages
insert-ethers
Back on the compute node
Boot CD
From laptop
Examine MySQL database through browser

51
Reinstalling Compute Nodes

With shoot-node
Frontend-driven reinstallation
With power on/off
Hail Mary to recover from bad software state

52
Structure of a Rocks Kickstart File

Site.h
Look at ks.cfg.in
preamble
packages
pre
post
post and post nochroot
include
For more info
http//redhat.com/support/manuals/RHL-7-Manual/ref
-guide/ch-kickstart2.html

53
Updating and Augmenting Your Distribution with
cluster-dist

Add new package to compute node in ks.cfg.in
Kickstart the compute node
Add new package to distro, then have compute node
pick it up
Put a package in contrib
Build distro -gt kickstart
Add a new local package (usher)
Bump version number
Build rpm
Show where it the RPM gets put
Build distro
Show new home for usher RPM
Build distro -gt kickstart

54
Adding Users with create-account

As root
create-account yoda
passwd yoda
ssl-genuser yoda
Ypcat
Shows how data made it into NIS
Ypcat passwd
Ypcat auto.home

55
Look at cluster nodes with Netscape

Database info

56
Launching and Controlling Jobs

REXEC
Rexec sleeper
PBS
Add a new queue to Maui

57
Cluster Monitoring with ganglia

Look at multiple different values from ganglia
ganglia
Lists commands
ganglia load_one
ganglia cpu_system cpu_idle cpu_nice
Go to millennium.berkeley.edu to see live demo

58
Auto configuration of node

Pop in Myrinet Card
After reinstallation, startup script uses lspci
to determine if Myrinet card is on the PCI bus
If yes
Automatically compile Myrinet device driver from
source rpm
Install Myrinet module
Module is then guaranteed compatible with running
kernel
Eliminates managing binary device drivers for
different kernel configurations