Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters

Description:

Builders of clusters which drive very large commercial databases ... Bootable CD floppy which contains all the packages and site configuration info ... – PowerPoint PPT presentation

Number of Views:210
Avg rating:3.0/5.0
Slides: 59
Provided by: philip339
Learn more at: http://web.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to the NPACI Rocks Clustering Toolkit: Building Manageable COTS Clusters


1
Introduction to the NPACI Rocks Clustering
ToolkitBuilding Manageable COTS Clusters
  • Philip M. Papadopoulos,
  • Mason J. Katz,
  • Greg Bruno

2
Who We Are
  • Philip Papadopoulos
  • Parallel message passing expert (PVM and Fast
    Messages)
  • Mason Katz
  • Network protocol expert (x-kernel, Scout and Fast
    Messages)
  • Greg Bruno
  • 10 years experience with NCRs Teradata Systems
  • Builders of clusters which drive very large
    commercial databases
  • All three of us have worked together for the past
    2 years building NT and Linux clusters

3
Who is NPACI Rocks ?
  • Key people from the UCB Millennium Group
  • Prof. David Culler
  • Eric Fraser
  • Brent Chun
  • Matt Massie
  • Albert Goto
  • People from SDSC
  • Bruno, Katz, Papadopoulos (Distributed Computing
    Group)
  • Kenneth Yoshimoto (Scheduling)
  • Keith Thompson, Bill Link (Grid)
  • Storage Resource Broker (SRB) Group
  • You !

4
Why We Do Clusters Frankly, we love it
  • Building high-performance systems which have
    killer price/performance is a gas
  • NPACI is about building pervasive infrastructure.
    Supported, transferable cluster infrastructure
    was missing from our portfolio.
  • Enabling others to build their own clusters and
    do scientific simulation is a blast.
  • We wanted a management system that would allow us
    to rapidly experiment with new low-level system
    software (and recover when things didnt go quite
    right)
  • Protect ourselves from ourselves ?

5
What Well Cover
  • Rocks philosophies
  • Hardware components
  • Software packages
  • Theory and practice
  • Lab

6
What we thought we Learned
  • Clusters are phenomenal price/performance
    computational engines, but are hard to manage
  • Cluster management is a full-time job which gets
    linearly harder as one scales out.
  • Heterogeneous Nodes are a bummer (network,
    memory, disk, MHz, current kernel version).

7
You Must Unlearn What You Have Learned
8
Installation/Management
  • Need to have a strategy for managing cluster
    nodes
  • Pitfalls
  • Installing each node by hand
  • Difficult to keep software on nodes up to date
  • Management increases as node count increases
  • Disk Imaging techniques (e.g.. VA Disk Imager)
  • Difficult to handle heterogeneous nodes
  • Treats OS as a single monolithic system
  • Specialized installation programs (e.g. IBMs
    LUI, or RWCPs Multicast installer)
  • let Linux packaging vendors do their job
  • Penultimate
  • RedHat Kickstart
  • Define packages needed for OS on nodes, kickstart
    gives a reasonable measure of control.
  • Need to fully automate to scale out (Rocks)

9
Scaling out
  • Evolve to management of two systems
  • The front end(s)
  • Log in host
  • Users home areas, passwords, groups
  • Cluster configuration information
  • The compute nodes
  • Disposable OS image
  • Let software manage node heterogeneity
  • Parallel (re)installation
  • Cluster-wide configuration files derived through
    reports from a MySQL database (DHCP, hosts, PBS
    nodes, )

10
NPACI Rocks Toolkit rocks.npaci.edu
  • Techniques and software for easy installation,
    management, monitoring and update of clusters
  • Installation
  • Bootable CD floppy which contains all the
    packages and site configuration info to bring up
    an entire cluster
  • Management and update philosophies
  • Trivial to completely reinstall any (all) nodes.
  • Nodes are 100 automatically configured
  • Use of DHCP, NIS for configuration
  • Use RedHats Kickstart to define the set of
    software that defines a node.
  • All software is delivered in a RedHat Package
    (RPM)
  • Encapsulate configuration for a package (e.g..
    Myrinet)
  • Manage dependencies
  • Never try to figure out if node software is
    consistent
  • If you ever ask yourself this question, reinstall
    the node

11
More Rocksisms
  • Leverage widely-used (standard) software wherever
    possible
  • Everything is in RedHat Packages (RPM)
  • RedHats kickstart installation tool
  • SSH, Telnet, Existing open source tools
  • Write only the software that we need to write
  • Focus on simplicity
  • Commodity components
  • For example x86 compute servers, Ethernet,
    Myrinet
  • Minimal
  • For example no additional diagnostic or
    proprietary networks
  • Rocks is a collection point of software for
    people building clusters
  • It will evolve to include cluster software and
    packaging from more than just SDSC and UCB
  • ltyour-software.i386.rpm your-software.src.rpm
    heregt

12
Hardware
13
Many variations on a basic layout
Front-end Node(s)
Power Distribution (Net addressable units as
option)
Public Ethernet
Fast-Ethernet Switching Complex
Gigabit Network Switching Complex
14
Frontend and Compute Nodes
  • Choices
  • Uni or Dual, Intel Processors
  • Linux is, in reality, an Intel OS
  • Rackmount vs. Desktop chassis
  • Rackmount essential for large installations
  • SCSI vs. IDE
  • Performance is a non-issue
  • Price and serviceability are the real
    considerations
  • Note rackmount servers usually are SCSI
  • User integration versus system integrator
  • Our Nodes
  • Dual PIIIs (733, 800 and 933 MHz Compaq, IBM)
  • 1.0 GHz as we expand
  • ½ GB node (1 GB would be better)
  • Hot swap SCSI on these nodes
  • We integrate our hardware

15
Networks
  • High-performance networks
  • Myrinet, Giganet, Servernet, Gigabit Ethernet,
    etc.
  • Ethernet only ? Beowulf-class
  • Management Networks (Light Side)
  • Ethernet 100 Mbit
  • Management network used to manage compute nodes
    and launch jobs
  • Nodes are in Private IP (192.168.x.x) space,
    front-end does NAT
  • Ethernet 802.11b
  • Easy access to the cluster via laptops
  • Plus, wireless will change your life
  • Evil Management Networks (Dark Side)
  • A serial console network is not necessary
  • A KVM (keyboard/video/monitor) switching system
    adds too much complexity, cables, and cost

16
Power Distribution
Ethernet port
Power sockets
  • Highly desirable to have network addressable
    power distribution units
  • Can remotely power cycle compute nodes
  • Instrumented which help determine power needs

17
Other Helpful Hardware When All Else Fails
  • When a node appears to be sick
  • Issue a reinstall command over the network
  • If still dead, instruct the network addressable
    power distribution unit to power cycle the node
    (this reinstalls the OS)
  • If still dead, roll up the crash cart
  • Monitor and keyboard

18
Leatherman A Must-Have For Any Self-Respecting
Clusters Person
19
Current Configuration of the Meteor Cluster
  • Rocks v2.0
  • 2 Frontends
  • 100 nodes
  • 50 GB RAM
  • Ethernet
  • For management
  • Myrinet
  • Servernet
  • Working through some bugs

20
Software
21
RedHat Supplied Software
  • 7.0 Base Updates
  • RPM
  • RedHat Package Manager
  • Kickstart
  • Method for unattended server installation

22
Community Software
  • Myricoms General Messaging (GM)
  • MPICH
  • GM device
  • Ethernet device
  • Portable Batch System
  • Maui
  • PVM
  • Intels Math Kernel Library
  • Math functions tuned for Intel processors

23
NPACI Rocks Software
  • Cluster-dist
  • A tool used to assemble the latest RedHat,
    community and Rocks packages into a distribution
    which is used by compute nodes during
    reinstallation
  • Shoot-node and eKV (Ethernet Keyboard and Video)
  • Initiate a compute node reinstallation
  • Monitor compute node reinstallations over
    Ethernet with telnet
  • Cluster-admin and cluster-ssl
  • Tools to create user accounts and user SSL
    certificates
  • Rexec (UC Berkeley)
  • Launch and control parallel jobs (SSL-based
    authentication)
  • Ganglia (UC Berkeley)
  • Cluster monitoring

24
Software Details
25
Cluster-dist
  • Integrate RedHat Packages from
  • Redhat (mirror) base distribution updates
  • Contrib directory
  • Locally produced packages
  • Packages from rocks.npaci.edu
  • Produces a single updated distribution that
    resides on front-end
  • Is a RedHat Distribution with patches and updates
    applied
  • Different Kickstart files and different
    distribution can co-exist on a front-end to add
    flexibility in configuring nodes.

26
Remote re-installationShoot-node and eKV
  • Rocks provides a simple method to remotely
    reinstall a node (once it has been installed the
    first time)
  • By default, hard power cycling will cause a node
    to reinstall itself.
  • With no serial (or KVM) console, we are able to
    watch a node as installs

27
Remote re-installationShoot-node and eKV
192.168.254.254
Remotely starting reinstallation on two nodes
192.168.254.253
28
Starting Jobs
  • SSH-based MPI-Launch
  • Provides full integration with Myrinet
    reservation capability of Usher/Patron
  • SSL-Based Rexec
  • Better control of jobs on remote nodes
  • Sane signal propagation
  • Batch System PBS Maui
  • PBS provides queue definition and node monitoring
  • Maui has rich scheduling policies
  • Standing and Future Reservations
  • Query number of available now nodes

29
PBS Portable Batch System
  • Three standard components to PBS
  • MOM Node health reporting daemon, Job Launch
    daemon on every node
  • Server On front-end queue definition,
    aggregation of node information
  • Scheduler Policies for what job to run out of
    which queue at what time
  • We added a fourth
  • Configuration Get cluster node configuration
    from our SQL database.

30
PBS RPM Packaging
  • Repackaged PBS (Sane packaging enhancements)
  • Added chkconfig-compatible start-up scripts
  • 4 packages
  • pbs (server and scheduler) (should be divided
    again)
  • pbs-mom
  • pbs-config-sql (Python script to generate
    database report)
  • pbs-common (files needed by all three packages)
  • A Rocks 2.0 base installation (automatically)
    defines a default queue with all nodes being
    available in the queue
  • http//pbs.mrj.com is a good starting point for
    PBS

31
PBS Server defaults (and changing them)
  • Startup script /etc/rc.d/init.d/pbs-server
    start
  • /usr/apps/pbs/pbs.default
  • Sourced every time pbs is started
  • Id pbs.default.in,v 1.5 2001/02/16 195938
    bruno Exp
  • A basic pbs setup that creates a queue called
    default and starts scheduling
  • Create queues and set their attributes.
  • Create and define queue default
  • 1 node default, 1hr walltime
  • create queue default
  • set queue default queue_type Execution
  • set queue default resources_default.nodes 1
  • set queue default resources_default.walltime
    10000
  • set queue default enabled True
  • set queue default started True

32
PBS.defaults (contd)
  • Set server attributes.
  • Assume maui scheduler will be installed
  • set server managers maui_at_frontend-0
  • set server operators maui_at_frontend-0
  • set server default_queue default
  • set server log_events 511
  • set server mail_from adm
  • set server scheduler_iteration 600
  • set server schedulingfalse
  • PBS will ignore queue creation if a queue already
    exists.

33
Modifying the default setup (simple queue
creation)
  • Use qmgr to create a new queue
  • /usr/apps/pbs/bin/qmgr
  • Max open servers 4
  • Qmgr create queue single
  • Qmgr set queue single queue_typeexecution
  • Qmgr set queue single enabledtrue
  • Qmgr set queue single acl_hostscompute-1-0
  • Qmgr set queue single startedtrue
  • Use qmgr command to save configuration
  • /usr/apps/pbs/bin/qmgr -c "print server gt
    /usr/apps/pbs/pbs.default

34
Maui Scheduler
  • We use Maui as our scheduler for PBS
  • mauischeduler.sourceforge.net
  • http//havi.supercluster.org/documentation/maui
  • Add the single queue definition so that Maui
    understands. This is in /usr/spool/maui/maui.cfg
  • SRNAME0 single
  • SRHOSTLIST0 compute-1-0
  • Restart Maui
  • /etc/rc.d/init.d/maui restart
  • Submit a job to PBS
  • /usr/apps/pbs/bin/qsub q single mytest.sh

35
Monitoring your cluster
  • PBS has a GUI called xpsmon. Gives a nice
    graphical view of up/down state of nodes
  • SNMP status
  • Use the extensive SNMP MIB defined by the Linux
    community to find out many things about a node
  • Installed software
  • Uptime
  • Load
  • Ganglia (UCB) IP Multicast-based monitoring
    system

36
Ganglia - http//www.millennium.berkeley.edu/gangl
ia/
  • Dendrite on each node
  • Multicasts state of the machine on significant
    changes
  • Load averages, disk consumption, memory, etc.
  • Beacons every minute, if no significant deltas
  • Axons
  • Collection daemons (at least one/cluster)
  • Ganglia client Sort the measured variables to
    find a set of hosts that match a desired criteria
  • E.g. X MB free memory, load below Y
  • Can act as a vexec resource for Rexec.

37
Ganglia text output
  • phil_at_slic01 /usr/sbin/ganglia load_one
  • compute-1-5 0.07
  • compute-0-9 0.08
  • compute-1-3 0.14
  • compute-2-0 0.15
  • compute-2-8 0.18
  • compute-2-5 0.27
  • frontend-0 0.36
  • compute-3-11 0.82
  • compute-23 1.06
  • compute-22 1.19
  • compute-3-4 1.96
  • compute-3-9 1.99
  • compute-3-10 1.99
  • compute-3-2 2.00
  • compute-3-3 2.09
  • compute-3-7 2.12
  • compute-3-5 2.99
  • compute-3-6 3.0

38
Hidden Software
39
Some Tools that assist in automation.
  • Users generally will not see these tools
  • Profile scripts run at users first login
  • Usher-patron (Myrinet port reservation)
  • Insert-ethers (Add nodes to a cluster)
  • Cluster-sql package
  • Reports to build service-specific config files
  • Cluster-admin
  • Node reinstallation
  • Creating accounts (NIS, auto.home map creation)
  • Cluster-ssl
  • Generate keys for SSL authentication (rexec)

40
Usher/Patron
  • Tool to simplify using installed Myricom Hardware
  • Eliminates a central database to decide which
    Myrinet ports are currently in use
  • (Myricom driver installed with a separate source
    RPM)
  • Usher daemon runs on each compute node. Takes
    reservation requests for access to the limited
    set of Myrinet ports (RPC-based)
  • Reservations time out, if not claimed.
  • Patron works with usher to request and claim
    ports
  • Integrated with MPI-Launch
  • Automatically creates node file need for MPICH-GM

41
First Login Profile Scripts
  • On first login, all users, including root, are
    prompted to build an SSH public/private key pair
  • Makes sense because ssh is the only way to gain
    login access to the nodes
  • NIS is updated (passwd, auto.home, etc.)
  • Additionally, if its the first time root has
    logged in, a SSL certificate authority is
    generated which is used to sign users SSL
    certificates
  • The SSL certificate and roots public SSH key are
    then propagated the to compute node kickstart file

42
insert-ethers
  • Used to populate the nodes MySQL table
  • Parses a file (e.g., /var/log/messages) for
    DHCPDISCOVER messages
  • Extracts MAC addr and, if not in table, adds MAC
    addr and hostname to table
  • For every new entry
  • Rebuilds /etc/hosts and /etc/dhcpd.conf
  • Reconfigures NIS
  • Restarts DHCP and PBS
  • Hostname is
  • ltbasenamegt-ltcabinetgt-ltchassisgt
  • Configurable to change hostname
  • E.g., when adding new cabinets

43
dhcp_options One More Important MySQL Table
  • Created by the Frontend kickstart file (based on
    user input from Rocks configuration web page)
  • Used by makedhcp to construct the header in
    /etc/dhcpd.conf

44
Configuration Derived from Database
Automated node discovery
mySQL DB
Node 0
insert-ethers
Node 1
makehosts
makedhcp
pbs-config-sql
Node N
/etc/hosts
/etc/dhcpd.conf
pbs node list
45
Futures
  • Attack the storage problem
  • Keep the global view of storage that NFS gives
    us, but address the scalability problem
  • Source high bandwidth from the cluster into the
    WAN
  • Apply our cluster bring-up automation to easily
    attach clusters to the grid
  • Continue to improve cluster monitoring
  • Configure a monitoring GUI (e.g., NetSaint) to
    extract data from Ganglia
  • Get node health (Fan Speed, Temp., Disk Error
    rate) into Ganglia
  • Technologies
  • Processors IA-64 and Alpha
  • Networks Infiniband
  • 2.4 kernel (Will rev our distribution at RedHat
    7.1)

46
Lab
47
Front-end Node
  • Node seen by external world
  • Performs Network Address Translation (NAT)
  • NFS Server(s) for user home areas
  • Beware of scalability issues!
  • Compilers, libraries
  • Configuration for Nodes
  • DHCP Server, NIS Domain Controller, NTP Server,
    Web Server, MySQL Server
  • Installation Server for defining system on nodes
  • Method(s) to start jobs on compute nodes
  • Batch System (PBS Maui)
  • Interactive launching of jobs

48
Installing a Front-end Machine
  • Build ks.cfg from https//rocks.npaci.edu/site.htm
  • Define your root password
  • NIS Domain
  • Public IP Address
  • Boot CD
  • Full ISO image for download. Burn your own!
  • Enter frontend at the boot prompt.
  • Sit back. Time varies depending on speed of the
    CPU and CDROM of frontend
  • Entire distribution is being copied to
    /home/install/cluster-dist

49
Building a Distribution with cluster-dist
  • Directory structure
  • Build mirror
  • From mirror host
  • Emulates mirroring from rocks
  • Build distro
  • cluster-dist dist

50
Installing Compute Nodes
  • Login as root to frontend
  • Execute tail f /var/log/messages
    insert-ethers
  • Back on the compute node
  • Boot CD
  • From laptop
  • Examine MySQL database through browser

51
Reinstalling Compute Nodes
  • With shoot-node
  • Frontend-driven reinstallation
  • With power on/off
  • Hail Mary to recover from bad software state

52
Structure of a Rocks Kickstart File
  • Site.h
  • Look at ks.cfg.in
  • preamble
  • packages
  • pre
  • post
  • post and post nochroot
  • include
  • For more info
  • http//redhat.com/support/manuals/RHL-7-Manual/ref
    -guide/ch-kickstart2.html

53
Updating and Augmenting Your Distribution with
cluster-dist
  • Add new package to compute node in ks.cfg.in
  • Kickstart the compute node
  • Add new package to distro, then have compute node
    pick it up
  • Put a package in contrib
  • Build distro -gt kickstart
  • Add a new local package (usher)
  • Bump version number
  • Build rpm
  • Show where it the RPM gets put
  • Build distro
  • Show new home for usher RPM
  • Build distro -gt kickstart

54
Adding Users with create-account
  • As root
  • create-account yoda
  • passwd yoda
  • ssl-genuser yoda
  • Ypcat
  • Shows how data made it into NIS
  • Ypcat passwd
  • Ypcat auto.home

55
Look at cluster nodes with Netscape
  • Database info

56
Launching and Controlling Jobs
  • REXEC
  • Rexec sleeper
  • PBS
  • Add a new queue to Maui

57
Cluster Monitoring with ganglia
  • Look at multiple different values from ganglia
  • ganglia
  • Lists commands
  • ganglia load_one
  • ganglia cpu_system cpu_idle cpu_nice
  • Go to millennium.berkeley.edu to see live demo

58
Auto configuration of node
  • Pop in Myrinet Card
  • After reinstallation, startup script uses lspci
    to determine if Myrinet card is on the PCI bus
  • If yes
  • Automatically compile Myrinet device driver from
    source rpm
  • Install Myrinet module
  • Module is then guaranteed compatible with running
    kernel
  • Eliminates managing binary device drivers for
    different kernel configurations
Write a Comment
User Comments (0)
About PowerShow.com