Title: Leveraging Standard Core Technologies to Programmatically Build Linux Cluster Appliances
1Leveraging Standard Core Technologies to
Programmatically Build Linux Cluster Appliances
- University of Zurich
- May 5, 2003
2Outline
- Problem definition
- What is so hard about clusters?
- Distinction between
- Software Packages (bits)
- System Configuration (functionality and state)
- Programmatic software installation with
- XML, SQL, HTTP, Kickstart
- Future Work
3Build this cluster
- Build a 128 node cluster
- Known configuration
- Consistent configuration
- Repeatable configuration
- Do this in an afternoon
- Problems
- How to install software?
- How to configure software?
- We manage clusters with (re)installation
- So we care a lot about this problem
- Other strategies still must solve this
4The Myth of the Homogeneous COTS Cluster
- Hardware is not homogeneous
- Different chipset revisions
- Chipset of the day (e.g. Linksys Ethernet cards)
- Different disk sizes (e.g. changing sector sizes)
- Vendors do not know this is happening!
- Entropy happens
- Hardware components fail
- Cannot replace with the same components past a
single Moore cycle - A Cluster is not just compute nodes (appliances)
- Fileserver Nodes
- Management Nodes
- Login Nodes
5What Heterogeneity Means
- Hardware
- Cannot blindly replicate machine software
- AKA system imaging / disk cloning
- Requires patching the system after cloning
- Need to manage system software at a higher level
- Software
- Subsets of a cluster have unique software
configuration - One golden image cannot build a cluster
- Multiple images replicate common configuration
- Need to manage system software at a higher level
6Description Based Software Installation
7Packages vs. Configuration
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
8Software Packages
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
9System Configuration
Collection of all possible software
packages (AKA Distribution)
Descriptive information to configure a node
Kickstart file
RPMs
Appliances
Compute Node
IO Server
Web Server
10What is a Kickstart File?
- Setup Packages (20)
- cdrom
- zerombr yes
- bootloader --location mbr --useLilo
- skipx
- auth --useshadow --enablemd5
- clearpart --all
- part /boot --size 128
- part swap --size 128
- part / --size 4096
- part /export --size 1 --grow
- lang en_US
- langsupport --default en_US
- keyboard us
- mouse genericps/2
- timezone --utc GMT
- rootpw --iscrypted nrDq4Vb42jjQ.
- text
- install
- Post Configuration (80)
- post
- cat gt /etc/nsswitch.conf ltlt 'EOF'
- passwd files
- shadow files
- group files
- hosts files dns
- bootparams files
- ethers files
- EOF
- cat gt /etc/ntp.conf ltlt 'EOF'
- server ntp.ucsd.edu
- server 127.127.1.1
- fudge 127.127.1.1 stratum 10
- authenticate no
- driftfile /etc/ntp/drift
- EOF
11Issues
- High level description of software installation
- List of packages (RPMs)
- System configuration (network, disk, accounts, )
- Post installation scripts
- De facto standard for Linux
- Single ASCII file
- Simple, clean, and portable
- Installer can handle simple hardware differences
- Monolithic
- No macro language (as of RedHat 7.3 this is
changing) - Differences require forking (and code
replication) - Cut-and-Paste is not a code re-use model
12XML Kickstart
13It looks something like this
14Implementation
- Nodes
- Single purpose modules
- Kickstart file snippets (XML tags map to
kickstart commands) - Over 100 node files in Rocks
- Graph
- Defines interconnections for nodes
- Think OOP or dependencies (class, include)
- A single default graph in Rocks
- Macros
- SQL Database holds site and node specific state
- Node files may contain ltvar namestate/gt tags
15Composition
- Aggregate Functionality
- Scripting
- IsA perl-development
- IsA python-development
- IsA tcl-development
16Functional Differences
- Specify only the deltas
- Desktop IsA
- Standalone
- Laptop IsA
- Standalone
- Pcmcia
17Architecture Differences
- Conditional inheritance
- Annotate edges with target architectures
- if i386
- Base IsA lilo
- if ia64
- Base IsA elilo
18Putting it all together
- Complete Appliances (compute, NFS, frontend,
desktop, )
- Some key shared configuration nodes
(slave-node, node, base)
19Sample Node File
lt?xml version"1.0" standalone"no"?gt lt!DOCTYPE
kickstart SYSTEM "_at_KICKSTART_DTD_at_" lt!ENTITY ssh
"openssh"gtgt ltkickstartgt ltdescriptiongt Enable
SSH lt/descriptiongt ltpackagegtsshlt/packagegt
ltpackagegtssh-clientslt/packagegt ltpackagegtssh-s
erverlt/packagegt ltpackagegtssh-askpasslt/packagegt
ltpostgt cat gt /etc/ssh/ssh_config ltlt
'EOF lt!-- default client setup --gt Host
ForwardX11 yes ForwardAgent
yes EOF chmod orx /root mkdir /root/.ssh chmod
orx /root/.ssh lt/postgt lt/kickstartgtgt
20Sample Graph File
lt?xml version"1.0" standalone"no"?gt lt!DOCTYPE
kickstart SYSTEM "_at_GRAPH_DTD_at_"gt ltgraphgt ltdescrip
tiongt Default Graph for NPACI Rocks. lt/descripti
ongt ltedge from"base" to"scripting"/gt ltedge
from"base" to"ssh"/gt ltedge from"base"
to"ssl"/gt ltedge from"base" to"lilo"
arch"i386"/gt ltedge from"base" to"elilo"
arch"ia64"/gt ltedge from"node" to"base"
weight"80"/gt ltedge from"node"
to"accounting"/gt ltedge from"slave-node"
to"node"/gt ltedge from"slave-node"
to"nis-client"/gt ltedge from"slave-node"
to"autofs-client"/gt ltedge from"slave-node"
to"dhcp-client"/gt ltedge from"slave-node"
to"snmp-server"/gt ltedge from"slave-node"
to"node-certs"/gt ltedge from"compute"
to"slave-node"/gt ltedge from"compute"
to"usher-server"/gt ltedge from"master-node"
to"node"/gt ltedge from"master-node"
to"x11"/gt ltedge from"master-node"
to"usher-client"/gt lt/graphgt
21Cluster SQL Database
22Nodes and Groups
Nodes Table
Memberships Table
23Groups and Appliances
Memberships Table
Appliances Table
24Simple key - value pairs
- Used to configure DHCP and to customize appliance
kickstart files
25Putting it together
26Space-Time and HTTP
Node Appliances
Frontends/Servers
DHCP
IP Kickstart URL
Kickstart RQST
Generate File
kpp
SQL DB
Request Package
Serve Packages
kgen
Install Package
- HTTP
- Kickstart URL (Generator) can be anywhere
- Package Server can be (a different) anywhere
Post Config
Reboot
27Practice
28256 Node Scaling
- Attempt a TOP 500 Run on a two fused 128 node
PIII (1GHz, 1GB mem) clusters - 100 Mbit ethernet, Gigabit to frontend.
- Myrinet 2000. 128 port switch on each cluster
- Questions
- What LINPACK performance could we get?
- Would Rocks scale to 256 nodes?
- Could we set up/teardown and run benchmarks in
the allotted 48 hours? - SDSCs Teragrid Itanium2 system is about this size
29Setup
New Frontend
8 Cross Connects (Myrinet)
128 nodes (120 on Myrinet)
128 nodes (120 on Myrinet)
- Fri Night Built new frontend. Physical rewiring
of Myrinet, added Ethernet switch. - Sat Initial LINPACK runs, and debugging hardware
failures, 240 node Myri run. - Sun Submitted 256 Ethernet run, re-partitioned
clusters, complete re-installation (40 min)
30Some Results
240 Dual PIII (1Ghz, 1GB) - Myrinet
- 285 GFlops
- 59.5 Peak
- Over 22 hours of continuous computing
31Installation, Reboot, Performance
- lt 15 minutes to reinstall 32 node subcluster
(rebuilt myri driver) - 2.3min for 128 node reboot
32 Node Re-Install
Start
Finsish
Reboot
Start HPL
32Future Work
- Other backend targets
- Solaris Jumpstart
- Windows Installation
- Supporting on-the-fly system patching
- Cfengine approach
- But using the XML graph for programmability
- Traversal order
- Subtleties with order of evaluation for XML nodes
- Ordering requirements ! Code reuse requirements
- Dynamic cluster re-configuration
- Node re-targets appliance type according to
system need - Autonomous clusters?
33Summary
- Installation/Customization is done in a
straightforward programmatic way - Leverages existing standard technologies
- Scaling is excellent
- HTTP is used as a transport for
reliability/performance - Configuration Server does not have to be in the
cluster - Package Server does not have to be in the cluster
- (Sounds grid-like)
34www.rocksclusters.org