Administration Tools for Managing Large Scale Linux Cluster presentation

About This Presentation

Transcript and Presenter's Notes

Title: Administration Tools for Managing Large Scale Linux Cluster

1
Administration Tools for Managing Large Scale
Linux Cluster

CRC KEK Japan
S.Kawabata, A.Manabe
atsushi.manabe_at_kek.jp

2
Linux PC Clusters in KEK
3
PC Cluster 2 PenIII 800MHz 80CPU (40 nodes)
PC Cluster 1 PenIII Xeon 500MHz 144 CPUs (36
nodes)
4
PC Cluster 3 (Belle) Pentium III Xeon 700MHz
320CPU (80 nodes)
5
PC cluster 4 (Neutron simulation)

Fujitsu TS225 50 nodes
Pentium III 1GHz x 2CPU
512MB memory
31GB disk
100BaseTX x 2
1U rack-mount model
RS232C x2
Remote BIOS setting
Remote reset/power-off

6
PC Cluster 5 (Belle) 1U server Pentium III
1.2GHz 256 CPU (128 nodes)
7
3U
PC Cluster 6 Blade server LP Pentium III
700MHz 40CPU (40 nodes)
8
PC clusters

Already more than 400 (gt800CPUs) nodes Linux PC
clusters were installed.
Only gtmiddle size PC cluster are counted.
A major exp. (Belle) group plan to install
several x100 nodes of blade server in this year.
All PC clusters are managed by individual user
group themselves.

9
Center Machine (KEK CRC)

Currently machines in KEK Computer Center(CRC)
are UNIX(solaris,AIX) servers.
Plan to have gt1000 nodes Linux computing cluster
in near future (2004).
Will be installed under 4years rental
contract. (every 2 years HW update ?)

10
Center Machine

The system will be share among many user groups.
(dont dedicate to one gr. only)
Their demand for CPU power vary with months.
(High demand before intl-conference or so on)
Of course, we use load-balancing Batch system.
Big groups uses their own software frame work.
Their jobs only run under some restricted version
of OS(Linux) /middle-ware/configuration.

11
RD system

Frequent change of system configuration/ cpu
partition.
To manage such size of PC cluster and such user
request, we need to have some sophisticated
admin. tools.

12
Necessary admin. tools

System (SW) Installation /update
Configuration
Status Monitoring/ System Health Check
Command Execution

13
Installation tool
14
Installation tool

Two types of installation tool
Disk Cloning
Application Package Installer
system(kernel) is an application in this term.

15
Installation tool (cloning)
Image Cloning
Install system/application on a master host.
16
Installation tool (package installer)
request
Package server
Image and control
Package Information DB
Clients
Package archive
17
Remote Installation via NW

Cloning disk image
SystemImager (VA) http//systemimager.sourceforge.
net/
CATS-i (soongsil Univ.)
CloneIt http//www.ferzkopp.net/Software/CloneIt/
Comercial ImageCast, Ghost,..
Packages/Applications installation
Kickstart rpm (RedHat)
LUI (IBM) http//oss.software.ibm.com/developerwor
ks/projects/lui
Lucie (TiTec) http//matsu-www.is.titech.ac.jp/ta
kamiya/lucie/
LCFGng, Arusha

Public Domin Software
18
Dolly

We developed image cloning via NW installer
dolly.
WHY ANOTHER?
We install/update
maybe frequently (according to user needs)
1001000 nodes simultaneously.
Making packages for our own softwares is boring.
Traditional Server/Client type software suffer
server bottleneck.
Multicast copy with GB image seems unstable.(No
free soft ? )

19
(few) Server - (Many) Client model

Server could be a daemon process.(you dont need
to start it by hand)
Performance is not scalable against of nodes.
Server bottle neck. Network congestion

Multicasting or Broadcasting

No server bottle neck.
Get max performance of network which support
multicasting in switch fabrics.
Nodes failure does not affect to all the process
very much, it could be robust.
Since failed node need re-transfer. Speed is
governed by the slowest node as in RING topology.
Not TCP but UDP, so application must take care
of transfer reliability.

20
Dolly and Dolly

Dolly
A Linux application software to copy/clone files
or/anddisk images among many PCs through a
network.
Dolly is originally developed by CoPs project in
ETH (Swiss) and an open software.
Dolly features
Sequential files (no limitation of over 2GB)
and/or normal files (optinaldecompress and untar
on the fly) transfer/copy via TCP/IP network.
Virtual RING network connection topology to cope
with server bottleneck problem.
Pipeline and multi-threading mechanism for
speed-up.
Fail recovery mechanism for robust operation.

21
Dolly Virtual Ring Topology
Master host having original image

Physical network connection is as you like.
Logically Dolly makes a node ring chain which
is specified by dollys config file and send data
node by node bucket relay.
Though transfer is only between its two adjacent
nodes, it can utilize max. performance ability of
switching network of full duplex ports.
Good for network complex of many switches.

node PC
network hub switch
physical connection
Logical (virtual) connection
22
Cascade Topology

Server bottle neck could be overcome.
Cannot get maximum network performance but better
than many client to only one serv. topology.
Week against a node failure. Failure will spread
in cascade way as well and difficult to recover.

23
PIPELINING multi threading
24
Performance of dolly
HW FujitsuTS225 PenIII 1GHz x2, SCSI disk,
512MB mem, 100BaseT NW
25
(No Transcript)
26
Fail recovery mechanism

Only one node failure could be show stopper in
RING (series connection) topology.
Dolly provides automatic short cut mechanism
against a node trouble.
In a node trouble, the upper stream node detect
it by sending time out.
The upper stream node negotiate with the lower
stream node for reconnection and retransfer of a
file chunk.
RING topology makes its implementation easy.

time out
27
Re-transfer in short cutting
BOF
EOF
1 2 3 4 5 6 7 8 9 ..
File chunk 4MB
6
9
8
7
6
network
Server
5
8
7
Node 1
network
5
7
6
Node 2
Works with even Sequential file.
Next node
28
Dolly How do you start it on linux
Config file example

Server side (which has the original file)
dollyS -v -f config_file
Nodes side
dollyC -v

iofiles 3 /dev/hda1 gt /tmp/dev/hda1 /data/file.gz
gtgt /data/file boot.tar.Z gtgt /boot server
n000.kek.jp firstclient n001.kek.jp lastclient
n020.kek.jp client 20 n001 n002
n020 endconfig
of files to Xfer master name of
client nodes clients names end code
The left of gt is input file in the server.
The right is output file in clients. 'gt' means
dolly does not modify the image. 'gtgt' indicate
dolly should cook (decompress , untar ..) the
file according to the name of the file.
29
How does dolly clone the system after booting.

Nodes broadcast over the LAN in search of an
installation server (Pre-eXecution Environment).
PXE/DHCP server respond to nodes with information
about the nodes IP and kernel download server.
The kernel and ram disk image are Multicast
TFTPed to the nodes and the kernel gets start.
The kernel hands off to an installation script
which run a disk tool and dolly .(scripts and
appli. are in the ram disk image)

30
How does dolly start after rebooting.

The code partitions the hard drive, creates file
systems and start dolly client on the node.
You start dolly master on the master host to
start up a disk clone process.
The code then configure unique node information
such as Host name, IP addess from DHCP
information.
ready to boot from its hard drive for the first
time.

31
PXE Trouble

BY THE WAYwe suffered sometimes PXE mtftp
transfer failure in the case of gt20 nodes booting
simultaneously.
If you have same trouble, mail me please.
We start rewriting mtftp client code of RedHat
Linux PXE server.

32
Configuration
33
(Sub)system Configuration

Linux (Unix) has a lot of configuration file to
configure sub-systems. If you have 1000nodes, you
have to manage (many)x1000 config. files.
To manage them, three types of solution
Cetralized information service server (like
NIS).
Need support by sub-system (nsswitch)
Automatic remote editing raw config. files
(like cfengine).
Must care about each nodes file separately.

34
Configuration--new proposal from CS.

Program (configure) whole system with a source
code by O.O way.
Systematic uniform way configuration.
Source reuse (inheritance) as much as possible.
Template
override to other-sites configuration.
Arusha (http//ark.sourceforge.net)
LCFGng (http//www.lcfg.org)

35
LCFGng (Univ. Edinburgh)
New Compile
36
LCFGng

Good things
Author says that it works on 1000 nodes.
Fully automatic. (you just edit source code and
compile it in a host.)
Differences of sub-systems are hidden from user
(administrator). (or move to components
(DB-gtactual config file))

37
LCFGng

Configuration Language is too primitive.
Hostname.Component.Parameter Value
Components are not so manyor you must write your
own components scripts for each sub-system by
yourself.
far easier writing config. file itself than
writing component.
Activating timing of the config. change could not
be controlled.

38
Status monitoring
39
Status Monitoring

System state monitoring
CPU/memory/disk/network utilization
Ganglia1,plantir2
(Sub-)system service sanity check
Pikt3/Pica4/cfengine

1 http//ganglia.sourceforge.net 2
http//www.netsonde.com 3 http//pikt.org 4
http//pica.sourceforge.net/wtf.html
40
Ganglia ( Univ. Calfornia)

Gmond (each node)
All node multicast each system status info.
each other and each node has current status of
all nodes. -gt good redundancy and robust
declare that it works on 1000 nodes
Meta-deamon (Web server)
stores volatile data of gmond in Round-robin DB
and represent XML image of all nodes activity
Web Interface

41
(No Transcript)
42
Plantir (Network adaption )

Quick understanding of system status from One Web
Page.

43
Remote Execution
44
Remote execution

Administrator sometimes need to issue a command
to all (part of ) nodes urgently.
Remote execution could be rsh/ssh/pikt/cfengine/S
UT(mpich) /gexec..
Points are
To make it easy to know the execution result
(fail or success) at a glance.
Parallel execution among nodes.
Otherwise If it takes 1sec. at each node, then
1000 sec for 1000 nodes.

) Scalable Unix tools for cluster
http//www-unix.mcs.anl.gov/sut/
45
WANI

WEB base remote command executer.
Easy to select nodes concerned.
Easy to specify script or to type-in command
lines to execute in nodes.
Issue the commands to nodes in parallel.
Collect result with error/failure detection.
Currently, the software is in prototyping by
combinations of existing protocol and tools.
(Anyway it works!)

46
WANI is implemented on Webmin GUI
Start
Command input
Node selection
47
Command execution result
Switch to another page
Host name
Results from 200nodes in 1 Page
48
Error detection
BG color
49
(No Transcript)
50
Node hosts
execution
WEB Browser
Piktc_svc
PIKT server Webmin server
Piktc
error detector
print_filter
Lpd
51
Summary

I reviewed admin. tools which can be used against
1000 nodes Linux PC cluster.
Installation dolly
Install/Update/Switch hosts gt100 nodes very
quickly.
Configuration manager
Not matured yet. But can expect a lot from
DataGrid research.
Status monitor
seems several good software already exists.
Extra daemons and network traffic.

52
Summary

Remote Command Execution
Result at a glance is important for quick
iteration.
Parallel execution is important.
Some programs and links is /will be
http//corvus.kek.jp/manabe
Thank you for your listening.

53
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Administration Tools for Managing Large Scale Linux Cluster PowerPoint PPT Presentation