Title: Administration Tools for Managing Large Scale Linux Cluster
 1Administration Tools for Managing Large Scale 
Linux Cluster
- CRC KEK Japan 
- S.Kawabata, A.Manabe 
- atsushi.manabe_at_kek.jp 
2Linux PC Clusters in KEK 
 3PC Cluster 2 PenIII 800MHz 80CPU (40 nodes)
PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36 
nodes) 
 4PC Cluster 3 (Belle) Pentium III Xeon 700MHz 
320CPU (80 nodes) 
 5PC cluster 4 (Neutron simulation)
- Fujitsu TS225 50 nodes 
- Pentium III 1GHz x 2CPU 
- 512MB memory 
- 31GB disk 
- 100BaseTX x 2 
- 1U rack-mount model 
- RS232C x2 
- Remote BIOS setting 
- Remote reset/power-off 
6PC Cluster 5 (Belle) 1U server Pentium III 
1.2GHz 256 CPU (128 nodes)  
 73U
PC Cluster 6 Blade server LP Pentium III 
700MHz 40CPU (40 nodes) 
 8PC clusters
- Already more than 400 (gt800CPUs) nodes Linux PC 
 clusters were installed.
- Only gtmiddle size PC cluster are counted. 
- A major exp. (Belle) group plan to install 
 several x100 nodes of blade server in this year.
- All PC clusters are managed by individual user 
 group themselves.
9Center Machine (KEK CRC)
- Currently machines in KEK Computer Center(CRC) 
 are UNIX(solaris,AIX) servers.
- Plan to have gt1000 nodes Linux computing cluster 
 in near future (2004).
- Will be installed under 4years rental 
 contract. (every 2 years HW update ?)
10Center Machine
- The system will be share among many user groups. 
 (dont dedicate to one gr. only)
- Their demand for CPU power vary with months. 
 (High demand before intl-conference or so on)
- Of course, we use load-balancing Batch system. 
- Big groups uses their own software frame work. 
- Their jobs only run under some restricted version 
 of OS(Linux) /middle-ware/configuration.
11RD system
- Frequent change of system configuration/ cpu 
 partition.
- To manage such size of PC cluster and such user 
 request, we need to have some sophisticated
 admin. tools.
12Necessary admin. tools
- System (SW) Installation /update  
- Configuration 
- Status Monitoring/ System Health Check 
- Command Execution
13Installation tool 
 14Installation tool
- Two types of installation tool 
- Disk Cloning 
- Application Package Installer 
- system(kernel) is an application in this term.
15Installation tool (cloning)
Image Cloning 
Install system/application on a master host. 
 16Installation tool (package installer)
request
Package server
Image and control
Package Information DB
Clients
Package archive 
 17Remote Installation via NW
- Cloning disk image 
- SystemImager (VA) http//systemimager.sourceforge.
 net/
- CATS-i (soongsil Univ.) 
- CloneIt http//www.ferzkopp.net/Software/CloneIt/ 
- Comercial ImageCast, Ghost,.. 
- Packages/Applications installation 
- Kickstart  rpm (RedHat) 
- LUI (IBM) http//oss.software.ibm.com/developerwor
 ks/projects/lui
- Lucie (TiTec) http//matsu-www.is.titech.ac.jp/ta
 kamiya/lucie/
- LCFGng, Arusha 
Public Domin Software 
 18Dolly
- We developed image cloning via NW installer 
 dolly.
- WHY ANOTHER? 
- We install/update 
- maybe frequently (according to user needs) 
- 1001000 nodes simultaneously. 
- Making packages for our own softwares is boring. 
- Traditional Server/Client type software suffer 
 server bottleneck.
- Multicast copy with GB image seems unstable.(No 
 free soft ? )
19(few) Server - (Many) Client model
- Server could be a daemon process.(you dont need 
 to start it by hand)
- Performance is not scalable against  of nodes. 
- Server bottle neck. Network congestion 
Multicasting or Broadcasting
- No server bottle neck. 
- Get max performance of network which support 
 multicasting in switch fabrics.
- Nodes failure does not affect to all the process 
 very much, it could be robust.
- Since failed node need re-transfer. Speed is 
 governed by the slowest node as in RING topology.
- Not TCP but UDP, so application must take care 
 of transfer reliability.
20Dolly and Dolly
- Dolly 
- A Linux application software to copy/clone files 
 or/anddisk images among many PCs through a
 network.
- Dolly is originally developed by CoPs project in 
 ETH (Swiss) and an open software.
- Dolly features 
- Sequential files (no limitation of over 2GB) 
 and/or normal files (optinaldecompress and untar
 on the fly) transfer/copy via TCP/IP network.
- Virtual RING network connection topology to cope 
 with server bottleneck problem.
- Pipeline and multi-threading mechanism for 
 speed-up.
- Fail recovery mechanism for robust operation.
21Dolly Virtual Ring Topology
Master  host having original image
- Physical network connection is as you like. 
- Logically Dolly makes a node ring chain which 
 is specified by dollys config file and send data
 node by node bucket relay.
- Though transfer is only between its two adjacent 
 nodes, it can utilize max. performance ability of
 switching network of full duplex ports.
- Good for network complex of many switches. 
node PC
network hub switch
physical connection
Logical (virtual) connection 
 22Cascade Topology
- Server bottle neck could be overcome. 
- Cannot get maximum network performance but better 
 than many client to only one serv. topology.
- Week against a node failure. Failure will spread 
 in cascade way as well and difficult to recover.
23PIPELINING  multi threading 
 24 Performance of dolly
HW FujitsuTS225 PenIII 1GHz x2, SCSI disk, 
512MB mem, 100BaseT NW 
 25(No Transcript) 
 26Fail recovery mechanism
- Only one node failure could be show stopper in 
 RING (series connection) topology.
- Dolly provides automatic short cut mechanism 
 against a node trouble.
- In a node trouble, the upper stream node detect 
 it by sending time out.
- The upper stream node negotiate with the lower 
 stream node for reconnection and retransfer of a
 file chunk.
- RING topology makes its implementation easy. 
time out 
 27Re-transfer in short cutting 
BOF
EOF
 1 2 3 4 5 6 7 8 9 ..
File chunk 4MB
6
9
8
7
6
network
Server
5
8
7
Node 1
network
5
7
6
Node 2
Works with even Sequential file.
Next node 
 28Dolly How do you start it on linux
Config file example
- Server side (which has the original file) 
-  dollyS -v -f config_file 
- Nodes side 
-   dollyC -v 
-  
iofiles 3 /dev/hda1 gt /tmp/dev/hda1 /data/file.gz 
 gtgt /data/file boot.tar.Z gtgt /boot server 
n000.kek.jp firstclient n001.kek.jp lastclient 
n020.kek.jp client 20 n001 n002 
 n020 endconfig 
  of files to Xfer master name  of 
client nodes clients names end code 
The left of gt is input file in the server. 
The right is output file in clients. 'gt' means 
dolly does not modify the image. 'gtgt' indicate 
dolly should cook (decompress , untar ..) the 
file according to the name of the file.  
 29How does dolly clone the system after booting. 
- Nodes broadcast over the LAN in search of an 
 installation server (Pre-eXecution Environment).
- PXE/DHCP server respond to nodes with information 
 about the nodes IP and kernel download server.
- The kernel and ram disk image are Multicast 
 TFTPed to the nodes and the kernel gets start.
- The kernel hands off to an installation script 
 which run a disk tool and dolly .(scripts and
 appli. are in the ram disk image)
30How does dolly start after rebooting. 
- The code partitions the hard drive, creates file 
 systems and start dolly client on the node.
- You start dolly master on the master host to 
 start up a disk clone process.
- The code then configure unique node information 
 such as Host name, IP addess from DHCP
 information.
- ready to boot from its hard drive for the first 
 time.
31PXE Trouble
- BY THE WAYwe suffered sometimes PXE mtftp 
 transfer failure in the case of gt20 nodes booting
 simultaneously.
- If you have same trouble, mail me please.  
- We start rewriting mtftp client code of RedHat 
 Linux PXE server.
32Configuration 
 33(Sub)system Configuration
- Linux (Unix) has a lot of configuration file to 
 configure sub-systems. If you have 1000nodes, you
 have to manage (many)x1000 config. files.
- To manage them, three types of solution 
-  Cetralized information service server (like 
 NIS).
- Need support by sub-system (nsswitch) 
- Automatic remote editing raw config. files 
 (like cfengine).
- Must care about each nodes file separately.
34Configuration--new proposal from CS.
- Program (configure) whole system with a source 
 code by O.O way.
- Systematic  uniform way configuration. 
- Source reuse (inheritance) as much as possible. 
- Template 
- override to other-sites configuration. 
- Arusha (http//ark.sourceforge.net) 
- LCFGng (http//www.lcfg.org)
35LCFGng (Univ. Edinburgh)
New Compile 
 36LCFGng
- Good things 
- Author says that it works on 1000 nodes. 
- Fully automatic. (you just edit source code and 
 compile it in a host.)
- Differences of sub-systems are hidden from user 
 (administrator). (or move to components
 (DB-gtactual config file))
37LCFGng
- Configuration Language is too primitive. 
- Hostname.Component.Parameter Value 
- Components are not so manyor you must write your 
 own components scripts for each sub-system by
 yourself.
- far easier writing config. file itself than 
 writing component.
- Activating timing of the config. change could not 
 be controlled.
38Status monitoring 
 39Status Monitoring
- System state monitoring 
- CPU/memory/disk/network utilization 
- Ganglia1,plantir2 
- (Sub-)system service sanity check 
- Pikt3/Pica4/cfengine 
1 http//ganglia.sourceforge.net 2 
http//www.netsonde.com 3 http//pikt.org 4 
http//pica.sourceforge.net/wtf.html  
 40Ganglia ( Univ. Calfornia)
- Gmond (each node) 
- All node multicast each system status info. 
 each other and each node has current status of
 all nodes. -gt good redundancy and robust
- declare that it works on 1000 nodes 
- Meta-deamon (Web server) 
- stores volatile data of gmond in Round-robin DB 
 and represent XML image of all nodes activity
- Web Interface
41(No Transcript) 
 42Plantir (Network adaption )
- Quick understanding of system status from One Web 
 Page.
43Remote Execution 
 44Remote execution
- Administrator sometimes need to issue a command 
 to all (part of ) nodes urgently.
- Remote execution could be rsh/ssh/pikt/cfengine/S
 UT(mpich) /gexec..
- Points are 
- To make it easy to know the execution result 
 (fail or success) at a glance.
- Parallel execution among nodes. 
- Otherwise If it takes 1sec. at each node, then 
 1000 sec for 1000 nodes.
) Scalable Unix tools for cluster 
http//www-unix.mcs.anl.gov/sut/ 
 45WANI 
- WEB base remote command executer. 
- Easy to select nodes concerned. 
- Easy to specify script or to type-in command 
 lines to execute in nodes.
- Issue the commands to nodes in parallel. 
- Collect result with error/failure detection. 
- Currently, the software is in prototyping by 
 combinations of existing protocol and tools.
 (Anyway it works!)
46WANI is implemented on Webmin GUI
Start
Command input
Node selection 
 47Command execution result
Switch to another page
Host name
Results from 200nodes in 1 Page 
 48Error detection
BG color 
 49(No Transcript) 
 50Node hosts
execution
WEB Browser
Piktc_svc
PIKT server Webmin server
Piktc
error detector
print_filter
Lpd 
 51Summary
- I reviewed admin. tools which can be used against 
 1000 nodes Linux PC cluster.
- Installation dolly 
- Install/Update/Switch hosts gt100 nodes very 
 quickly.
- Configuration manager 
- Not matured yet. But can expect a lot from 
 DataGrid research.
- Status monitor 
- seems several good software already exists. 
- Extra daemons and network traffic.
52Summary
- Remote Command Execution 
- Result at a glance is important for quick 
 iteration.
- Parallel execution is important. 
- Some programs and links is /will be 
-  http//corvus.kek.jp/manabe 
- Thank you for your listening.
53(No Transcript)